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Abstract 


The  characteristic  of  an  asynchronous  multiprocessor  Is  that  It  Is  composed  of 
severat  processors  capable  of  carryinR  out  the  execution  of  their  own  proRrams  In  a 
completely  independont  fashion.  As  a consequence,  parallel  alROrlthms  for  asynchronous 
multiprocessors  present  some  unique  aspects  in  both  their  deslRn  and  their  analysis.  This 
thesis  explores  the  issues  raised  by  the  dcslRn  and  the  analysis  of  parallel  alRorilhms  for 
asynchronous  multiprocessors  and  illustrates  the  various  notions  and  concepts  involved 
with  these  algorithms  by  considering  problems  in  diverse  areas.  The  thesis  demonstrates 
that  asynchronous  multiprocessors  can  be  used  efficiently  in  different  problem  domains, 
provided  that  appropriate  algorithms  are  used.  It  also  illustrates  various  techniques 
useful  in  the  analysis  of  such  algorithms. 

As  evidenced  by  a scries  of  experimental  results,  the  computation  lime  required  by 
a process  to  execute  several  instances  of  the  same  task  on  an  asynchronous  multiprocessor 
cannot  be  regarded  as  constant  and  is  actually  subject  to  important  fluctuations.  These 
fluctuations  in  compulation  times  have  a negative  effect  on  the  performance  of  parallel 
algorithms  when  several  processes  cooperatinR  in  the  solution  of  a problem  communicate 
extensively  among  themselves.  In  this  case,  when  synchronization  is  used,  it  lends  to 
introduce  a prohibitive  overhead  which  decreases  the  parallelism.  On  the  other  hand,  an 
algorithm  is  presented  to  illustrate  that  the  fluctuations  are  not  always  a negative  factor 
but  can  also  be  utilized  advantageously.  ^yThe  algorithm  demonstrates  the  seemingly 
counter-intuitive  result  that  the  execution  oFa  purely  sequential  program  can  still  be 
accelerated  on  an  asynchronous  multiprocessor  without  introducing  any  parallelism  within 
the  program  itself,  but  only  by  taking  advantage  of  the  fluctuations  in  compulation  times. 
Two  different  parallel  implementations  of  this  algorithm  are  proposed  (with  and  without 
critical  section^  and  analyses  are  presented  to  measure  the  speed-up  achievable. 

In  the  domain  of  numerical  applications,  the  class  of  asynchronous  iterative  methods 
is  introduced  to  remove  the  need  for  synchronization  in  the  implementation  of  Iterations 
for  solving  a system  of  equations  on  a multiprocessor.  This  class  Includes  iterations 
corresponding  to  parallel  implementations  in  which  the  cooperating  processes  have  a 
minimum  of  inter-communication  and  do  not  make  any  use  of  synchonization.  The  Purely 
asynchronous  method  is  a typical  example.  A sufficient  condition  is  established  which 
guarantees  the  convergence  of  any  asynchronous  iterations.  This  condition  is  satisfied  for 
systems  of  equations  found  in  numerous  practical  applicalions. 

Several  asynchronous  iterations  have  actually  been  implemented  on  an  asynchronous 
multiprocessor.  Experimental  results  are  reported,  and  they  show  that  the  Purely 
Asynchronous  method  achieves  an  almost  optimal  speed-up.  The  experiments  constitute  an 
illustration  of  the  various  notions  and  concepts  specific  to  the  design  and  analysis  of 
parallel  algorithms  for  asynchronous  multiprocessors.  It  is  also  shown  how  simple 
techniques  drawn  from  order  statistics  and  queueing  theory  can  be  used  to  predict  the 
experimental  results  with  a fair  accuracy. 

The  O'-/?  pruning  algorithm  serves  as  an  example  of  a non-numerical  application  in 
this  thesis.  The  sequential  algorithm  is  first  analyzed,  and  it  is  shown  that  the  branching 
factor  of  the  tx-/?  pruning  algorithm  for  a uniform  game  tree  of  degree  n grows  with  n as 
©fn/tn  n).  This  confirms  a claim  by  Knulh  and  Moore  that  deep  cut-offs  only  have  a 
second  order  effect  on  the  behavior  of  the  algorithm.  The  results  obtained  with  the 
sequential  algorithm  are  then  used  to  derive  an  efficient  parallel  Implementation  of  the 
U-/3  pruning  algorithm  on  an  asynchronous  multiprocessor.  An  analysis  of  the  parallel 
implementation  with  h processes  shows,  rather  surprisingly,  an  improvement  over  the 
original  algorithm  by  a (actor  larger  than  k. 
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Chapter  I 


Introduction 


1 - Introduction  and  motivation 


Parallel  computers  and  multiprocessors  offer  a natural  solution  to  the 
ever-increasing  demand  for  computing  power.  At  the  same  time,  their  evolution  has 
brought  about  the  need  for  the  development  of  efficient  parallel  algorithms.  This  need  is 
now  becoming  more  and  more  acute  since  recent  advances  in  computer  technology  have 
drastically  reduced  the  cost  of  components,  and  It  is  quite  conceivable  that  parallel 
computers  composed  of  1000  or  more  processors  will  bo  built  In  the  near  future. 

Parallelism  is  achievable  in  a variety  of  ways,  as  exemplified  by  the  various 
architectures  of  parallel  computers  already  existing.  Following  Flynn’s  classification  [21], 
we  mention  below  only  a few  among  the  more  important  ones.  For  a general  overview. 
Stone  [57]  offers  an  Introductory  presentation  of  parallel  computer  architecture!  Kuck  [36] 
evaluates  some  parallel  machine  organizations  in  relation  to  their  programming;  and 
Enslow  [19]  surveys  specifically  multiprocessor  organization,  which  is  of  central  Interest 
to  us  in  this  thesis. 

The  ILLIAC  IV  computer  [5]  is  a typical  example  of  an  SfMD  (Single  Instruction 
stream  Multiple  Data  stream)  machine  [21].  Often  referred  to  as  an  army  processor,  the 
ILLIAC  IV  was  designed  explicitly  for  solving  partial  differential  equations  by  the  method 
of  finite  differences  (typically,  for  weather  forecast).  It  is  composed  of  64  Identical 
processing  elements,  organized  as  an  8x8  array,  which  execute  synchronously  the  same 
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- instruction  possibly  operating  on  diffcronl  data.  The  CtX  STAR-100  [29]  and  the  Cray-1 
computer  [54]  are  also  SIMD  machines  in  Rynn’s  classification.  They  are  often  referred  to 
as  vector  computers,  and  they  gain  their  efficiency  by  providing  for  vector-type 
instructions,  capable  of  executing  in  parallel  the  same  operation  on  all  elements  of  a 
variable  size  vector  rather  than  on  a single  scalar.  Pipelined  computers  and  associative 

processors  also  belong  to  the  class  of  SIMD  machines;  a general  presentation  of  their 

• . 

architectures  can  be  found  in  [12]  and  [65],  respectively. 

This  thesis  is  concerned  with  another  type  of  parallel  computer,  classified  by  Flynn 
as  an  MiMD  (Multiple  Instruction  stream  Multiple  Data  stream)  machine  [21].  Throughout 
the  thesis,  this  typo  of  computer  will  be  referred  to  as  an  asynchronous  nuiLti processor, 
since  we  thinK  this  term  better  reflects  the  view  we  are  taking  here. 

Examples  of  asynchronous  multiprocessors  include  commercially  available  computers 
like  the  UNJVAC  1108  bi -processor;  special  purpose  computers  like  the  D825  [1],  produced 
for  command  and  control  military  applications;  and  research  products  like  C.mmp  [63]  and 
Cm*  [59].  C.mmp  and  Cm*  have  been  (and  are  being)  built  at  Carnegie-Mellon  University 
using  mini -processors,  slightly  modified  versions  of  the  DEC  PDP-11  and  the  DEC  LSl-11. 
While  Q.mmp  is  truly  a multiprocessor,  in  that  each  processor  has  a direct  access  to  each 
memory  bank  through  a cross-point  switch.  Cm*  could  also  be  considered  as  a local 
network,  in  which  intercommunication  takes  place  between  clusters  (each  processor, 
however,  can  actually  access  the  entire  common  memory  through  a sophisticated  address 
mechanism  [30],  [59]). 

Wo  do  not  Intend  to  go  into  the  details  of  the  archilecture  of  any  asynchronous 
multiprocessors.  (See  [19]  for  a general  survey  of  the  architectures  of  existing 
multiprocessors.)  For  the  purpose  of  the  thesis,  it  is  sufficient  to  consider  an 
asynchronous  multiprocessor  as  composed  of  a set  of  independent  processors  sharing  a 
common  memory,  each  processor  being  able  to  carry  out  the  execution  of  its  own  program. 
In  this  respect  the  execution  of  programs  on  an  asynchronous  multiprocessor,  unlike  on  an 
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SIMD  machine,  Is  made  In  a completely  asynchronous  fashion  and  takes  on  a chaotic 
appearance.  This  Is  especially  true  since  the  processors  are  not  necessarily  of  the  same 
type,  as  Is  the  case  with  C.mmp  (composed  of  both  PDP-11/20  and  PDP-ll/40),  and  could 
actually  have  drastically  different  characteristics,  particularly  in  speeds.  Another  reason 
is  that  access  to  memory  is  not  necessarily  uniform,  as  is  the  case  with  Cm*.  Notice  that, 
in  this  broad  sense,  a network  of  computers  could  be  viewed  as  an  asynchronous 
multiprocessor  as  well  since,  in  this  case,  the  computers  can  still  be  considered  to  share  a 
common  memory,  although  very  indirectly.  As  a matter  of  fact,  the  algorithms  that  we 
propose  In  this  thesis  for  asynchronous  multiprocessors  are  also  well  suited  for 
implementation  over  a network,  especially  if  the  time  required  for  the  intercommunication 
between  the  computers  is  not  too  high  compared  lo  the  time  required  by  the  computation 
on  each  computer. 

After  this  very  brief  presentation  of  parallel  computer  architecture,  let  us  now  turn 
our  attention  to  the  issue  of  parallel  algorithms.  From  an  algorithmic  point  of  view,  SIMD 
machines  have  been  the  most  widely  studied  lo  date,  and  particularly  the  ILLIAC  IV  type  of 
computer.  Due  to  Its  specific  structure,  the  efficient  utilization  of  an  array  processor 
requires  that  a problem  be  decomposed  into  identical  sublasks  which  communicate  among 
each  other  in  some  regular  fashion,  and  the  range  of  possible  applications  Is,  therefore, 
limited  (mainly  lo  linear  algebra  oriented  problems).  Numerous  examples  of  parallel 
algorithms  for  SIMD  machines  in  the  area  of  numerical  linear  algebra  can  be  found  In  a 
recent  survey  by  Holler  [27].  Examples  of  non-numerical  algorithms  can  be  found,  for 
Instance,  in  [9],  [58],  and  [61]. 

Being  composed  of  a set  of  Independent  processors,  an  asynchronous  multiprocessor 
allows  for  greater  flexibility  In  its  prograa>ming  than  does  an  SIMD  machine.  Although 
asynchronous  multiprocessors  have  now  been  in  existence  for  several  years  (the  D825  [1], 
In  fact,  dales  back  to  the  early  60  s),  very  little  has  been  published  so  far  on  how  to 
design  parallel  algorithms  that  run  efficiently  on  an  asynchronous  multiprocessor.  Until 
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recently,  emphasis  in  the  design  of  parallel  algorithms  for  multiprocessors  has  been 
placed  mainly  on  techniques  for  recognizing  the  intrinsic  paralUUsni  of  existing  sequential 
algorithms  rather  than  on  the  direct  construction  of  parallel  algorithms.  Some  of  these 
techniques  have-  actually  been  implemented  in  a version  of  the  Algol-68  compiler  running 
on  Cm*  [28].  Typically,  the  transformation  of  a sequential  program  is  accomplished  by 
identifying  independent  subtasks  within  the  program  and  introducing  precedence  relations 
between  them;  a parallel  program  then  can  execute  the  various  subtasks  according  to  the 
graph  of  the  relations.  However,  a parallel  program  resulting  directly  from  this  automatic 
transformation  requires  considerable  communication  and  extensive  synchronization  to 
control  the  flow  of  execution  of  the  various  subtasks.  This  ultimately  reduces  its 
efficiency. 

In  the  domain  of  numerical  analysis,  a different  approach  in  designing  algorithms  for 
asynchronous  multiprocessors  has  proved  to  be  more  fruitful.  Rather  than  adapting 
existing  sequential  algorithms,  Chazan  and  Miranker[ll]  have  presented  a class  of 
iterative  methods  for  fho  solution  of  a linear  system  of  equations  which  takes  Into  account 
the  asynchronous  nature  of  multiprocessors. 

Essentially  initiated  by  a recent  paper  by  Kung  [37],  a systematic  study  is  now 
under  way  to  explore  some  of  the  unique  issues  raised  specifically  by  the  design  and  the 
analysis  of  parallel  algorithms  for  asynchronous  multiprocessors.  This  study  certainly 
benefits  from  an  extensive  research  done  on  a different,  but  related,  area  concerning 
time -shared  processors  rather  than  true  multiprocessors.  However,  results  in  the  latter 
area  deal  mostly  with  special  problems  typically  encontered  in  time-sharing  or 
multiprogramming  operating  systems,  e.  g.,  resource  allocation,  co-ordination  of 
independent  devices  (typically,  1/0  devices),  and  they  address  directly  the  issue  of 
co-operation  of  processes  without  addressing  general  Issues,  such  as  problem 
decomposition,  Involved  with  Ihe  design  of  multiprocessor  algorithms.  See,  for 
example,  [16]  for  an  early  presentation  of  this  area,  and  [2]  for  some  examples  of  typical 
problems.) 
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In  addition  to  [37],  a few  examples  of  typical  algorithms  for  multiprocessors  have 
already  appeared,  and  they  illustrate  several  important  notions  unique  In  their  design  [6], 
[38],  [39],  [40]  and  in  their  analysis  [3],  [4],  [8],  [51], 

This  thesis  Is  concerned  specifically  with  the  design  and  the  analysis  of  parallel 
algorithms  for  asynchronous  multiprocessors.  In  Section  2 of  this  chapter,  we  briefly 
discuss  the  main  issues  involved  in  their  designs.  The  remaining  chapters  of  the  thesis 
study  these  issues  in  depth  in  several  problem  domains.  These  results  are  summarized  In 
Section  3 of  this  chapter. 

2 - The  design  of  algorithms  for  asynchronous  multiprocessors 

Algorithms  for  SIMO  machines  and  algorithms  for  asynchronous  multiprocessors  are 
similar  in  principle,  in  that  they  both  rely  on  the  decomposition  of  a problem  into  subtasks 
executed  in  parallel.  This  Is,  however,  Iheir  only  similarity,  and  these  two  types  of 
parallel  algorithms  in  general  present  drastic  differences  with  respect  to  both  their  design 
and  Iheir  analysis.  Let  us  examine,  in  this  section,  some  of  the  unique  Issues  raised  by 
parallel  algorithms  for  asynchronous  multiprocessors. 

Most  of  the  problems  associated  with  the  design  of  parallel  algorithms  for 

asynchronous  multiprocessors  have  been  clearly  exposed  by  Kung  [37].  Throughout  the 

thesis,  we  use  the  notions  and  concepts  introduced  in  his  paper,  and,  below,  we  briefly 

review  some  of  the  more  important  ones.  In  particular,  [37,  p.  156]: 

"We  define  a parallel  algorithm  for  multiprocessors  as  a collection  of 
concurrent  processes  that  may  operate  simultaneously  for  solving  a 
given  problem.” 

It  Is  Important  to  distinguish  between  the  notion  of  procest,  which  corresponds  to  the 
execution  of  a procedure  or  a piece  of  program,  and  the  notion  of  processor,  the  physical 
entity  which  carries  out  the  execution  of  a process.  While  we  have  control  over  the 
processes  in  the  design  of  a parallel  algorithm,  we  do  not  usually  have  control  over  the 
processors,  which  are  administered  by  the  operating  system.  In  particular,  the  same 
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process  Is  not  necessarily  executed  by  only  one  processor  during  Its  entire  lifetime,  and, 
upon  decisions  of  the  operating  system,  several  processors  might  bo  assigned  successively 
to  its  execution.  As  an  immediate  consequence,  the  lime  required  for  the  execution  of  a 
process  on  an  asynchronous  multiprocessor  can  fluctuate  in  an  almost  unpredictable  way. 
There  are,  in  fact,  numerous  reasons  contributing  to  this  unpredictable  behavior*,  we 
already  mentioned  the  fact  that  the  different  processors  of  an  asynchronous 
multiprocessor  might  have  different  speeds  and  that  the  access  to  memory  is  not 
necessarily  uniform;  several  other  features  of  an  asynchronous  multiprocessor  dr  of  its 
environment  which  also  contribute  to  the  fluctuations  in  the  execution  time  of  a process 
are  listed  in  [37]. 

Communication  is  very  likely  to  be  required  among  the  processes  co-operating  in 
the  so'ulion  of  a problem..  Kung  [37]  regards  a process  as  a sequence  of  stages  defined 
between  two  consecutive  interaction  points  at  which  the  process  communicates  with  other 
processes.  Parallel  algorithms  for  multiprocessors  are  then  classified  according  to  the 
way  in  which  communication  is  accomplished.  In  a synchronized  parallel  algorithm  (or, 
simply,  a synchronized  algorithm)  processes  explicitly  use  synchronization  primitives,  and, 
upon  completion  of  a stage,  a process  may  have  to  wait  for  the  results  of  other  processes 
before  resuming  its  execution;  a producer-consumer  type  of  program  is  a typical  example 
of  a synchronized  algorithm.  In  an  asynchronous  parallel  algorithm  (or,  simply,  an 
asynchronous  algorithm)  the  processes  communicate  among  themselves  only  through  the 
use  of  global  variables  (possibly  updated  within  a critical  section),  and,  at  the  completion 
of  a stage,  a process  either  terminates  or  proceeds  further,  without  any  delay,  according 
to  the  current  contents  of  the  global  variables.  Examples  of  asynchronous  algorithms  are 
presented  in  the  following  chapters. 

Let  us  now  address  briefly  (and  informally)  the  issues  of  correctness  and  of 
efficiency,  both  of  which,  we  feel  should  always  be  dealt  with  In  the  design  of  any 
algorithms.  These  issues  arc  not  Ihe  only  ones  which  should  be  taken  Into  account,  but,  In 
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Ihe  case  of  parallel  algorithms  for  asynchronous  multiprocessors,  these  two  issues  become 
particularly  interesting  and  important  because  of  the  a priori  unpredictabte  behavior  in 
the  execution  of  those  algorithms.  For  this  very  reason,  however,  we  can  anticipate  that 
proving  the  correctness  and  analyzing  the  efficiency  of  an  algorithm  for  multiprocessor 
are,  in  general,  difficult  tasks. 

2.1  - Correctness 

Correctness  is  obviously  a requirement  for  any  algorithm.  Considerable  research 
has  been  done  on  the  proof  of  correctness  of  sequential  programs,  and  a detailed 
treatment  of  some  of  the  techniques  available  can  be  found,  for  example,  in  Dijkstra's 
recent  text  [17].  These  techniques,  however,  arc  mostly  applicable  to  sequential  programs 
with  a simple  structure  (with  no  complicated  data  structures,  for  instance),  and  their 
generalization  to  parallel  programs  (especially  asynchronous  parallel  programs)  is  still 
quite  limited^ 

An  early  paper  by  Dijkstra  [16]  contains  the  first  major  statement  on  the  proof  of 
correctness  of  parallel  programs.  Research  in  this  area  has  been  restricted  mostly  to 
proving  the  correctness  of  the  solutions  of  small  problems,  which  could  be  used  for  the 
implementation  of  some  mechanisms  in  larger  parallel  programs  (e.  g.,  the  readers  and 
writers  problem  [13],  or  the  producer -consumer  scheme  [26]).  Several  attempts  have 
been  made  only  very  recently  to  extend  some  of  the  techniques  to  the  proof  of 
correctness  of  complete  and  more  complex  parallel  programs  [47],  [20]. 

Despite  the  lack  of  a formal  theory,  we  still  feel  that  we  have  given  with  every 
algorithm  presented  in  this  thesis  a convincing  argument  that  it  performs  correctly.  This 
proof  of  correctness  can  lake  on  very  different  aspects.  In  Chapter  II,  tor  example,  we  give 

a proof  of  the  correctness  of  a parallel  program  by  verifying  that  global  variables  used  in 

• . 

the  program  satisfy  some  properly  which  holds  during  the  entire  execution  of  the  programi 
this  is  achieved  by  checking  the  possible  transitions  of  the  global  variables  before  and 
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after  interaction  points.  In  some  respect,  the  proof  resembles  more,  in  this  case,  the 
formal  proof  of  a sequential  program  using  assertions  and  mt/armntsi  this  is  partly  due  to 
the  simple  structure  of  the  particular  parallel  program  we  are  dealing  with.  In  Chapter  III, 
on  the  other  hand,  the  proof  of  the  correctness  v'and  of  the  termination)  of  the  algorithm 
follows'  directly  from  the  theorem  of  convergence  which  is  derived  through  techniques  of 
numerical  analysis. 

2.2  - Efficiency 

In  the  design  of  any  algorithm,  efficiency  is  always  an  imporlant  issue.  Since  one  of 
the  primary  goals  in  the  design  of  a parallel  algorithm  is  to  achieve  better  efficiency  than 
with  a sequential  algorithm,  this  issue  must  bo  considered  very  seriously  in  the  case  of  an 
algorithm  for  asynchronous  multiprocessor. 

We  would  like  to  illustrate  below  that,  because  of  the  fluctuations  in  the  execution 
times  on  an  asynchronous  multiprocessor,  synchronized  algorithms  will  generally  show  a 
very  poor  performance.  This  is  true  for  several  reasons.  The  execution  time  of  the 
synchronization  primitives  themselves  is  often  very  time  consuming  (a  typical  execution 
time  for  these  primitives  is  usually  on  the  order  of  a couple  of  hundreds  of  additions). 
Also,  and  most  importantly,  the  use  of  synchronization  implies  the  blocking  of  the 
processes  co-operating  in  a task,  and,  in  turn,  either  causes  some  of  the  processors  to  be 
idle  or  entails  the  switching  of  contexts.  In  both  cases,  the  use  of  synchronization  may 
reduce  the  parallelism  and  decrease  the  speed-up  that  we  hope  to  achieve  by  using  an 
asynchronous  multiprocessor. 

To  illustrate  this  point,  let  us  consider  Jacobi’s  method  to  solve  the  linear  system  of 
equations  given  by; 

X m A X * b , 

whore  A is  an  nx/i-matrlx,  and  b and  * are  n-vectors.  Let  xq  be  an  initial  approximation  to 
the  solution  of  this  system,  Jacobi’s  method  consists  of  computing  the  sequence  of  Iterates 
I x^,  for  i m i,  2, ...,  through  the  recurrence: 


IN7R0DDCT10N 


9 


^ *t-I  ♦ ^ • 

This  method  is  well  soiled  for  parallel  compulation  since,  at  each  step  of  the  iteration,  the 
compulations  of  all  components  can  be  carried  out  in  parallel.  For  example,  assuming  that 
n processors  are  available,  a natural  way  lo  decompose  Ihe  computation  of  a new  iterate 
Is  to  assign  to  each  of  the  n processors  the  computation  of  one  of  the  n components  of  the 
iterate.  This  implementation  requires,  however,  that  at  the  end  of  each  step  all  processes 
be  synchronized  before  they  can  start  the  computation  of  the  next  iterate.  In  case  all 
processes  take  exactly  the  same  amount  of  time  to  compute  a component,  the  overhead 
introduced  by  the  synchronization  is  reduced  to  the  execution  time  of  the  synchronization 
primitives  themselves.  However,  it  follows  from  the  discussion  at  the  beginning  of  the 
section  that  it  is  more  realistic  to  assume  that  the  lime  taken  by  a process  to  compute  a 
component  is  a random  variable  rather  than  a constant.  In  this  case  the  time  it  takes  to 
compute  the  whole  set  of  components  of  a now  Iterate  Is  given  by  the  maximum  of  n 
randoms  variables.  In  particular,  to  give  an  idea,  assume  that  the  time  for  the  computation 
of  any  component  is  distributed  according  lo  the  same  exponential  distribution  with  mean 
t,  then,  simple  calculus  shows  that  the  mean  computing  time  for  obtaining  a new  iterate  is 
given  by  where  • 1 ♦ ^ i is  the  n-th  harmonic  number.  The  coefficient 

represents  the  penalty  imposed  by  the  synchronization. 

This  simple  example  shows  that  the  apparent  parallelism  in  Jacobi ‘s  method  for 
solving  linear  systems  of  equations  is  considerably  reduced  by  the  fact  that  this  method 
implicitly  requires  synchronization  at  each  step  of  the  compulation.  In  fact,  it  can  be 
shown  that  the  proportion  of  time  wasted  by  the  processes  (while  they  are  idle,  waiting 
for  the  completion  of  the  last  computation)  is  given  by; 

I * ^2  * ••  * ^n-l  . j . _L  ^ I --L- 

I'’  « 

and  tends  to  / as  n tends  to  Infinity,  which  means  that  tha  proc*.ttii  -irm  aimott  always  tdlm 
waiting  for  each  other] 


This  example  also  shows  that,  when  programming  an  asynchronous  multiprocessor, 
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the  problem  of  the  fluctuations  in  Ihc  execulion  times  requires  much  attention,  and  that 
synchronization  should  be  used  very  carefully.  In  particular,  the  design  of  parallel 
programs  for  asynchronous  multiprocessors  should  lake  into  account  the  fact  that  the 
various  processors  execute  their  programs  independently  and  possibly  at  very  different 
speeds,  and  that,  therefore,  communication  among  the  processes  co-operating  In  a tasK 
should  be  reduced  to  a strict  minimum. 

3 - Thesis  overview 

This  thesis  explores  the  issues  raised  by  the  design  and  the  analysis  of  parallel 
algorithms  for  asytichronous  multiprocessors.  The  various  notions  and  concepts  involved 
with  those  algorithms  are  illustrated  by  considering  very  diverse  problem  areas  for 
numerical  as  well  as  non-numcrical  applications.  The  thesis  demonstrates.  In  particular, 
that  asynchronous  multiprocessors  can  be  used  very  effectively  in  different  problem 
. domains,  provided  that  appropriate  algorithms  are  used.  The  thesis  also  Illustrates 
various  techniques  useful  in  the  analysis  of  such  algorithms.  The  remaining  chapters  are 
briefly  summarized  below. 

We  have  just  shown,  in  Section  2.2,  that  the  fluctuations  in  the  execution  times  of 
programs  that  arc  run  on  an  asynchronous  multiprocessor  could  cause  a very  Important 
degradation  In  the  performance  of  synchronized  algorithms,  even  for  a problem  which  is,  a 
priori,  well  suited  for  parallel  Implementation.  In  Chapter  II,  we  show  that  we  have  the 
reverse  phenomenon  with  asynchronous  algorithms,  even  for  a purely  sequential  problem. 
Namely,  given  a sequence  of  tasks  to  be  performed  serially,  we  propose  an  asynchronous 
algorithm  to  accelerate  the  execution  of  the  tasks  on  an  asynchronous  multiprocessor 
without  Introducing  oarallclism  within  the  tasks  but  only  by  taking  advantage  of 
fluctuatioru  in  the  execution  timet.  Wc  give  a parallel  program  requiring  no  critical 
section  to  Implement  the  algorithm,  and  we  prove  its  correctness.  We  also  give  a 
spacewise  more  efficient  Implementation,  which  requires  the  use  of  critical  sections.  We 
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then  present  an  analysis  for  both  implementations  lo  estimate  the  speed-up  achievable 
with  the  parallel  algorithm,  and  we  show  that,  when  the  execution  times  are  exponentially 
distributed  and  no  critical  section  is  used,  the  algorithm  with  k processes  yields  a 
speed-up  of  order  Vlt. 

In  Chapter  III,  we  Introduce  the  class  of  asynchronous  itcratLve  methods  for  solving  a 
(linear  or  non-linear)  system  of  equations.  We  identify  existing  iterative  methods  in  terms 
of  asynchronous  iterations,  and  we  propose  new  schemes  corresponding  to  a purely 
asynchronous  algorithm  (with  no  synchronization  between  the  co-operating  processes). 
We  give  a sufficient  condition  (satisfied  in  most  practical  applicaticns)  to  guarantee  the 
convergence  of  any  asynchronous  iterations  and  extend  the  results  to;  include 
asynchronous  iterative  methods  with  memory.  We  then  evaluate  asynchronous  iterative 
methods  from  a computational  point  of  view,  we  derive  bounds  for  the  efficiency  and 
briefly  compare  the  bounds  with  experimental  results  (see  Chapter  V). 

Chapter  IV  deals  with  the  a-fl  pruning  agorithm.  In  the  first  part  of  Chapter  IV,  we 

analyze  the  sequential  ce-fi  pruning  algorithm,  using  the  number  of  terminal  nodes 

examined  by  the  algorithm  as  the  cost  measure.  The  analysis  takes  into  account  both 

shallow  and  deep  cut-offs,  and  we  also  consider  the  possibility  of  ties  between  terminal 

positions;  specifically,  we  assume  that  all  bottom  values  are  independent  Identically 
/ 

distrlbuled  random  variables  drawn  from  a discrete  probability  distribution.  Wo  show  that 
the  worst  case  of  the  algorithm  can  be  achieved  even  when  only  two  distinct  values  are 
assigned  to  the  terminal  nodes,  and  we  deduce  that  the  branching  factor  of  the 
pruning  algorithm  in  a uniform  game  tree  of  degree  n grows  with  n as  ©(n/ln  n), 
therefore  confirming  a claim  by  Knuth  and  Moore  [35]  that  deep  cut-offs  only  have  a 
second  order  effect  on  the  behavior  of  the  algorithm. 

In  the  second  part  of  Chapter  IV,  we  propose  a parallel  implementation  of  the 
oi-fi  pruning  algorithm  requiring  very  little  communication  between  the  processes.  In  the 
parallel  scheme,  the  processes  work  independently  by  searching  for  the  solution  of  the 
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game  tree  over  disjoint  subintervals.  We  develop  an  analysis  of  the  parallel  algorithm, 
from  which  It  follows  that  the  parallel  implementation  with  k processes  shows  an 
improvement  over  the  sequential  pruning  algorithm  by  a factor  larger  than  /t  for  k m 2 
or  3.  This  leads  to  the  rather  surprising  discovery  that  the  sequential  ot-/i  pruning 
algorithm  is  not  optimal. 

! In  Chapter  V,  we  present  the  results  of  moasuromcnls  performed  by  running  several 

1 

asynchronous  iterations  (introduced  in  Chapter  III)  on  C.mmp  [63],  an  asynchronous 
multiprocessor  at  Carnegie -Mellon  University.  These  experiments  have  proved  to  be  an 

I 

invaluable  tool  for  providing  us  with  some  insight  into  the  behavior  of  parallel  algorithms, 
and,  in  particular,  they  constitute  a clear  illustration  of  the  advantage  of  purely 
asynchronous  algorithms  over  synchronized  algorithms. 

I 

! 

I In  Chapter  VI,  we  show  how  the  classical  tools  of  queueing  theory  can  be  applied  to 

i the  analysis  of  the  performance  of  parallel  algorithms  for  asynchronous  multiprocessors, 

I 

I 

.and,  in  particular,  we  develop  a simple  queueing  model  to  account  for  the  behavior  of  a 
I parallel  program  which  uses  critical  sections.  We  then  compare  the  analytical  results 

! derived  from  the  model  with  the  experimental  results  presented  In  Chapter  V,  and  the 

i comparison  shows  an  excellent  agreement. 


In  the  last  chapter,  we  summarize  the  principal  results  of  the  thesis,  mention  some 
possible  extensions  and  give  some  concluding  remarks.  We  also  present  some  topics  for 
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Chapter  II 

Parallel  Execution  of  a Sequence  of  Tasks 
on  an  Asynchronous  Multiprocessor 


1 - Introduction 

We  are  Interested  in  the  design  and  analysis  of  parallel  algorithms  for  asynchronous 

multiprocessors  such  as  C.mmp  [63]  or  Cm*  [59].  For  any  given  task,  the  task  execution 

time  on  such  a system  is  dependent  upon  the  properties  of  the  operating  system,  effects 

of  other  users,  processor -memory  inlererence,  and  many  other  factors.  As  a result,  it  is 

• . 

necessary  to  assume  that  task  execution  limes  are  random  variables  rather  than  constants. 
(See  Chapter  V for  experimental  results  supporting  this  assumption.)  In  this  chapter  we 
propose  a novel  way  of  using  asynchronous  multiprocessors,  which  takes  advantage  of 
fluctuations  in  task  execution  limes.  We  will  present  our  result  as  a solution  to  the 
problem  of  executing  a sequence  of  n tasks  Wj,  „.,  under  the  following  conditions: 

Cl.  For  t - 2,  n,  task  cannot  be  started  before  the  comptetion  of  task 
(i.  e.,  the  tasks  are  linearly  ordered). 

C2.  For  » - 1,  ...,  n,  no  parallelism  can  be  utilized  in  the  execution  of  task  (1.  e., 
h we  are  not  atlowed  to  decompose  a task). 

C3.  The  execution  time  of  a task  is  a random  variable  rather  than  a constant. 
(This  condition  corresponds  to  the  asynchronous  nature  of  the  multiprocessor.) 

We  will  view  a parallel  algorithm  for  asynchronous  multiprocessors  as  a collection 
of  asynchronous  processes  which  communicate  among  each  other  through  the  use  of  global 
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variables.  Sucb  an  algorithm  will  be  defined  by  giving  the  procedure  each  of  Its 
processes  executes  when  assigned  lo  a processor.  While  analyzing  tho  algorithm,  wo  will 
always  assume  that  a processor  is  availablo  for  any  of  tho  runnable  processes  of  the 
algorithm.  (See  Kung  [37]  lor  a general  discussion  of  asynchronous  parallel  algorithms.) 

In  Section  2 we  give  an  algorithm  which  uses  kit  asynchronous  processes  to  sotve 
the  problem.  The  algorithm  is  interesting  because  at  most  one  process  Is  doing  useful 
work  at  any  given  time.  Nevertheless,  by  taking  advantage  of  condition  C3,  the  mean 
execution  time  is  less  for  k > 1 than  for  fc  - 1,  i.  e.,  a speed-up  Is  achieved. 

As  an  example,  consider  the  computation  of  xj x^  defined  by 

--  - 

where  xq,  x_j x_j  are  given  and  p is  some  iteration  function.  Let  w^^j  bo  the  task  of 

computing  Our  algorithm  could  be  used  to  execute  tasks  wj,  ._,  u/„,  which  is 

equivalent  to  evaluating  xj,  ...,  x^. 

The  speed-up  ratio  of  a parallel  algorithm  using  k processes  is  defined  in 

Section  3,  and  some  preliminary  results  are  proved  there.  In  Section  4 we  give  programs 
to  implement  our  algorithm  both  with  and  without  critical  sections  and  prove  Informally 
their  correctness.  In  Section  5 we  consider  the  implementation  without  critical  sections, 
and  obtain  an  analytic  expression  for  the  speed-up  under  certain  assumptions  (A1  and  A2 
of  Section  5).  For  large  n and  k,  our  result  Is  •»  V2k/K.  In  Section  6 we  consider  the 
Implementation  which  uses  critical  sections.  Here  the  analysis  is  more  difficult,  and  we 
can  obtain  analytic  results  only  for  k i 2.  Some  conclusions  end  open  problems  are  stated 
in  Section  7. 

2 - The  algorithm 

For  each  positive  Integer  k,  we  define  an  algorithm  with  k processes  for  executing 
tasks  wj,  w^  under  conditions  Cl  and  C2  stated  in  the  preceding  section.  The  algorithm 


is  specified  as  follows: 
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Whenever  a process,  P,  is  ready  to  execute  a task, 

(i)  If  no  task  has  yet  been  completed  by  any  process,  process  P starts  executing 
task  wj, 

(ii)  otherwise,  if  the  last  task  has  not  yet  been  completed  by  any  process, 
process  P starts  executing  a task  which  is  unfinished  and  ready  for  execution. 

For  simplicity,  we  will  assume  that  no  two  tasks  are  completed  at  the  same  lime.  Then, 
due  to  the  linear  ordering  of  the  tasks,  condition  (ii)  defines  without  ambiguity  a unique 
task  to  be  executed  by  process  P. 

Let  tj,  t2,  tj,  ...  with  be  the  limes  of  task  completion  by  the  processes.  The 

diagram  of  Figure  2.1  Illustrates  a possible  scheduling  of  the  tasks  when  they  are 
executed  by  the  algorithm  with  three  processes. 
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‘a  ‘9 

1 1 
h2  h3 

Figure  2.1  - A possible  task  scheduling  with  throe  processes 


Note  that,  when  process  P^  finishes  task  u/j  at  time  tg,  process  P2  bas  already  completed 
task  w^.  Thus,  after  Pj  completes  u/j,  it  starts  executing  u/5  rather  than  u/4.  Task  u/^  is 
skipped  by  Pg.  Similarly,  lacks  Wg  and  u/7  are  skipped  by  Pj,  and  tasks  11/2  and  by  P2- 
After  any  one  of  the  three  processes  has  executed  six  tasks,  tasks  Wf  through  Wg  rather 
than  tasks  u/j  through  wg  are  completed.  A speed-up  has  been  achieved! 

Observe  that  at  any  given  time  at  most  one  process  is  doing  work  useful  for  later 
computation.  With  respect  to  the  scheduling  given  by  Figure  2.1,  the  time  intervals  on 
which  processes  are  doing  useful  compulations  are  Indicated  in  Figure  2.2. 
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W2  \ 

Pi  I 1 1-... 

*1  h 

P2  • • • H 1 1-  ■ . . 

^4  *5  ‘7 

. U/5  U,g  wy 

P3  ...  ^ 1 1 1- . . . 

h U2  U3 

Figure  2.2  - Time  intervals  on  which  processes  are  doing  useful  work 

Thus  the  speed-up  « not  achieved  by  sharing  work  among  the  processes,'  but  is 
achieved  by  taking  advantage  of  fUictuations  in  the  execution  times. 

3 - A speed-up  measure 

Consider  the  algorithm  with  k processes  as  specified  in  the  preceding  section.  The 
algorithm  Is  said  to  be  the  sequential  algorithm  if  A - 1 and  to  be  a parallel  algorithm  If 
k > 1.  Let  T f^(n)  be  the  time  to  execute  tasks  wj, ...,  by  the  algorithm  with  k processes. 

Let  Ti^{n)  be  the  mean  of  the  random  variable  T^fnl.  We  define  the  speed-up  ratio  of  the 
algorithm  with  k processes  to  be 

S^(n)  - f,(n)/f,^(n). 

For  each  k and  (or  each  execution  of  the  algorithm  with  k processes,  we  define  ^ 
to  be  the  time  of  the  first  completion  of  task  w^,  and  define  “ 0-  example,  with 
respect  to  the  scheduling  of  Figure  2.1,  with  A - 3,  we  have: 

*3,1  - *1  • *3,2  - *2  » *3,3  * ‘5  » *3,4  " *7  - 

*3.5  " *9'  *3,6  ■ *12*  *3,7  " ^13 

The  following  theorem  describes  the  relation  between  and  {t^}  in  terms  of  the 
scheduling  of  the  tasks.  This  theorem  is  important  In  Sections  5 and  6 for  computing 
speed-up  ratios. 
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Thoorom  3.1 : 

Suppose  that  j “ t,-  1 s t i Then  ij,  for  some  1 a J i k if 

and  only  if 

(a)  the  j processes  completing  tasks  at  times  — > are  all  distinct,  and 

(b)  the  process  completing  task  at  time  is  one  of  the  j processes 

mentioned  in  (a). 

Proof: 

We  will  only  prove  the  necessary  condition  since  the  proof  for  the  sufficient 
condition  Is  similar. 

Suppose  that  some  process  P compleles  two  tasks  at  times  and  for 

0 i h < m i J-}.  Then,  since  at  time  r,.*/,  task  has  already  been  completed,  the  task 
completed  at  lime  by  process  P must  be  This  contradicts  the  fact  that  is 

completed  for  the  first  time  at  time  since  This  proves  (a). 

Let  P be  the  process  completing  task  u/j+j  , for  the  first  time,  at  time  Suppose 

that  P does  not  complete  any  task  in  the  interval  Then  the  task  completed  by 

P at  time  must  be  started  before  time  Sot  at  any  time  before  task  is  not 
completed  yet.  Hence  any  task  started  before  time  cannot  be  u'j+j-  In  particular,  the 
task  completed  by  P at  time  cannot  be  This  contradiction  proves  (b).  B 

For  i » J,  ...,  n,  let  be  the  random  variable  representing  the  quantity 

^k,i  - ^k,i-i-  " ^k,n' 

T^(n)  - t^d)  ♦ ♦ ...  ♦ t^(n) . ■ (3.1) 

Equation  (3.1)  will  be  used  laler  to  compute  T /^(n),  which  is  needed  for  evaluating  the 
speed-up  ratio  Si^(n). 

4 - Parallel  programs  for  the  algorithm  and  their  correctness 

We  give  two  programs  to  implement  the  algorithm  with  k processes:  one  without 
critical  sections  and  one  with  critical  sections. 
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4.1  - A program  without  critical  sections 
Program  A: 

■ global  Inlener  (or  real)  array  (/(/;«]; 
global  boolean  array  M[l.‘n+/]: 

InitlallzaUon: 

begin 

for  m 1 lo  n+l  do  Af[ni] :»  false; 

start  processes  Pj, 

end 

Process  Pj: 

begin  integer  my, 
nij  ;»  1; 


while  Mlmj]  do  mj  ntj  + 1: 

(4.1) 

while  nij  s n do 

(4.2) 

begin 

perform  task  i 

(4.3) 

write  the  output  of  task  on 

(4.4) 

M(my] true: 

(4.5) 

while  Mlmjl  do  my  ;■  my  ♦ 1 

(4.6) 

end 

end 


Assume  that  the  tasks  are  not  allowed  to  alter  the  array  M and  integers  nij.  We  will 
prove  that  Program  A is  correct  in  the  following  sense: 

PI.  For  m ■ 2;  n,  task  u/^j  is  executed  only  If  task  has  been  finished  and 

1 its  output  has  been  written  on  C/{m-]). 

P2.  For  J - 1 k,  process  Pj  can  execute  the  loops  at  (4.1),  (4.2)  and  (4.6)  at 


most  ft  limes. 
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P3.  All  Ihe  tasks  w^,  will  have  been  completed  at  the  lime  when  any  one  of 

the  processes  P^,  P|^  lerminales  its  execution. 

Properly  P2  guarantees  lhal  the  program  will  lerminale.  (Mole  that  there  is  no 
possibilily  of  deadlocks  in  the  program.)  Properly  PI  ensures  that  the  linear  ordering 
reqcjiremcnt  of  the  executions  of  the  tasks  is  maintained,  and  property  P3  implies  that 
when  the  program  terminates  all  the  tasks  are  completed. 


Lemma  4.1s 


(i)  For  m > I,  ....  n,  if  M(fnl  is  set  to  true,  it  remains  true  afterwards. 

(ii)  After  being  initialized  to  false.  Af(n'*l]  is  never  modified. 


Proof: 


After  initialization,  M can  only  be  modified  through  statement  (4.5)  executed  by 
some  process  Pj.  But,  when  entering  the  main  while-loop  (starling  with  statement  (4.2)), 
nij  satisfies  the  condition  nij  s n and  is  not  modified  before  execution  of  (4.5).  Therefore 
M[n*l]  ran  never  be  modified.  I 


Lemma  4.2: 


For  J m i k,  if  nij  has  the  value  m i 2,  then  Mf/n-i]  is  true. 


Proof; 


Suppose  that  nij  - m with  m z 2 at  time  t.  If  nij  was  incremented  by  1 to  the  value 
m inside  the  while  statement  (4.1)  or  (4.6),  then  the  test  of  Mfniy]  being  true  with 
nij  m m-1  must  have  been  satisfied.  Hence  Af(m-i]  was  true  at  some  time  before  t.  Thus, 
by  Lemma  4.1,  Af(m-1]  is  true  at  time  t.  I 


Lemma  4.3: 


For  m - 2, ...,  n,  if  M(m]  is  true,  then  W(m-lJ  is  true. 


Proof: 


Suppose  that  Af[m]  Is  true.  Then  At(/nl  n^ust  have  been  assigned  to  true  through 
instruction  (4.5)  by  some  process  Pj  with  nij  having  the  value  ni.  Therefore,  by 
Lemma  4.2,  M(m-1]  is  true.  I 
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L*mma  4.4; 

/ 

For  m m I,  n,  if  Af(/»]  is  true,  then  task  u/^  is  compteted  and  its  output  is  on 

C/(/n]. 

Proof; 

Suppose  that  Af[/»i]  is  true.  Then  Mfei]  must  have  been  assigned  to  true  throuRh 
instruction  (4.5)  by  some  process  P j with  nxj  having  the  value  m.  Since  P j executes 
instruction  (4.5)  only  after  the  completion  of  task  «/_,  and  since  »;,•  is  not  modified  In 

rrly  / 

between,  we  conclude  that  task  u/^j  is  completed.  ■ 

We  are  now  able  to  prove  the  following  theorem. 

Theorem  4.1 ; 

Program  A satisfies  properties  PI,  P2  and  P3. 

Proof: 

Suppose  that  process  Py  is  executing  task  ii/^^  with  m.  » ttij  a 2. ' Then,  by 
Lemma  4.2,  M[m-1]  is  true,  and  hence,  by  Lemma  4.4,  task  u/^_|  is  completed  and  its 
output  is  on  Lf[ni-f].  We  conclude  that  Program  A satisfies  property  PI.  ’ 

Property  P2  follows  from  statement  (ii)  of  Lemma  4.1  since  mj  is  Incremented  by  1 
in  each  execution  of  a loop. 

Suppose  that  a process,  say  process  Pj,  terminates.  This  happens  only  when 
nij  » n*l.  Thus,  by  Lemma  4.2,  M(n)  is  true  for  all  m <•  i, ...,  n.  Therefore,  by  Lemma  4.4, 
all  tasks  are  completed.  We  have  shown  that  Program  A also  satisfies  property  P3.  ■ 

Program  A is  very  rcliabte  in  the  following  sense.  Properly  P3  Implies  that,  even  if 
some  processes  fail  (for  reasons  external  to  the  algorithm:  e.  g.,  crash  of  the  processors 
executing  the  processes),  the  program  may  still  continue  executing  tasks  and  eventually 
complete  all  tasks,  provided  that  there  remains  at  least  one  active  process.  We  will  not 
pursue  .this  reliability  issue  any  further,  though  we  believe  It  is  Important. 
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4.2  - A program  with  critical  soctions 


For  problems  where  we  are  only  Inlercstcd  In  the  output  of  the  last  task  w^,  the 
use  of  the  global  arrays  U[l:n]  and  W(/.vi+i]  In  Program  A can  be  avoided  at  the  expense 
of  using  critical  seclions. 

We  will  Illustrate  the  idea  with  the  following  example.  Consider  the  problem  of 
generating  the  n-th  iterate  by  ;«  given  the  Initial  iterate  xq.  Suppose  that 

we  use  Program  A.  Then,  corresponding  to  the  global  array  (/[/.-n],  we  have  the  global 
array  *[0.7i]  whore  x[i]  keeps  the  value  of  the  t-lh  iterate,  and  instructions  <4.3)  and  (4.4) 
become 

x\mj\  ;»  </<(x[mj-l])  . 

Note  that  we  only  need  *[«].  The  use  of  the  array  *[0.7t]  is  wasteful  in  space,  and  might 
even  be  Impractical  (e.  g.,  when  n is  large  or  when  the  elements  *(0],  ...,  x[n]  are 
themselves  vectors  or  complicated  structures).  The  following  program  eliminates  this 
probtem.  , 

Program  B: 

global  integer  nv,  global  real  *; 

Initialization: 

begin 

m ii  * :»  xq\ 

start  processes  Pj,...,  Pf^ 

end 

S 

ii 


It  is  crucial  to  assume  that  the  statements  enclosed  within  a pair  of  curly  brackets 
(lines  (4.7),  (4.8)  and  (4.9))  are  programmed  as  critical  sections.  (As  a matter  of  fact,  the 
two  lines  (4.8)  and  (4.9)  can  be  programmed  as  one  critical  section.)  With  this  assumption 
it  is  possible  to  prove  the  correctness  of  the  above  program.  The  proof  Is  based  on  the 
observation  that  the  global  variable  m is  a non-decreasing  function  of  time  which  takes  on 


all  Integer  values  between  J and  n*L  The  proof  is  relatively  easy  and  hence  is  omitted 
here. 

Note  that,  as  was  already  mentioned,  * and  yj  may  represent  large  amount  of  data. 
Honce  the  execution  of  x yj  or  y,j x may  lake  a significant  amount  of  time.  After 
presenting,  in  Section  5,  an  analysis  for  programs  which  do  not  have  critical  sections,  we 
will  give,  in  Section  6,  an  analysis  for  programs  which  do  have  critical  sections. 

5 - Speed-up  ratios:  Implementations  without  critical  sections 

Let  t^  j be  the  random  variable  representing  the  time  to  execute  task  u/^  by  process 
Pj.  In  this  and  the  next  section,  we  assume  that  the  j,  for  i ^ t,  ....  n and  J • i,  k,  are 
independent  and  identically  distributed.  The  assumption  is  reasonable  when  all  tasks  are 
of  the  same  complexity  and  executed  by  identical  processors.  We  will  use  T to  denote  any 
of  the  random  variables  t^j,  and  use  er  to  denote  the  mean  of  T. 
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It  is  easy  to  obtain  T jfn).  By  equation  (3.1)  with  It  « i,  we  have: 

T j(n)  - ejdJ  * e](2)  * ...  * tj{n) . 

Since,  in  this  case,  the  trjd)  arc  inclepcndont  and  idcnlicaliy  distributed  with  mean  tr,  we 
deduce  that 

TjM  ~ ntr.  (5.1) 

In  the  rest  of  the  chapter,  in  order  to  evaluate  T f^(n),  we  impose  the  following 
further  assumptions: 

Al.  All  processes  starl  at  the  same  lime  t - 0.  (I.  e,  at  all  the  k processes  start 

* 

with  the  execution  of  task  u/j.) 

A2.  The  random  variable  T is  exponentially  dislribuled  with  mean  e. 

We  observe  that  by  the  independence  of  the  f;  ; and  by  assumption  A2  the 
quantities  Vf^d),  i - 1, ....  n,  are  independent  random  variables.  It  follows,  from 
equation  (3.1),  and  assumption  A2,  that 

Tif(n)  m K/fd)  * F|^(2)  * ...  * t)^(n)  , (5.2) 

where  Vf^d)  is  the  mean  of  ^ifd). 

In  addition,  by  assumption  Al,  is  given  by  the  minimum  of  k random  variables 
distributed  as  T.  Since  T is  exponentially  distributed,  the  minimum  has  the  mean: 

v,^d)  - (5.3) 

We  now  consider  (or  i « 1, ...,  n-t.  Define  the  distribution  probability  P/fjt 

j - I,  2,  ...,  as  follows.  (We  use  here  the  same  notation  as  in  Section  3.)  Let  p/^  j be  the 
probability  that  Sf^  given  that  i|^  i for  some  r.  Hence  for  y - k,  p^  j is 

the  probability  that  conditions  (a)  and  (b)  of  Theorem  3.1  hold.  Using  the  same  argument 
as  used  in  the  proof  of  Theorem  3.1,  it  is  easy  to  show  that  p^  j ” 0 11  J > k.  In  addition, 
assumption  A2  Implies  that,  from  the  memory-less  property  of  the  exponential 
distribution,  pf^  y is  Independent  of  i and  r.  We  have: 
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(tr,t  -t^  * (tr.2  - 


with  probability  p|^  ^ , 
with  probability  P|^  2 . 


L * ...  ♦ with  probability  p^j,  . 

Since  by  assumption  A2  the  random  variables  r « t,  2,  ...,  are  Independent  (and 

identically  distributed)  random  variables  with  mean  U;,  we  derive  from  equation  (5.4)  that, 

• k 

for  i ■ ],  ...,  n-l,  the  mean  of  Ki^(i*l)  is  given  by: 

- , Z O'  z^  Pk  j • r , , J Pj  k 

, lijik  k '‘••r  k lijik  A* 

By  equations  (5.2),  (5.3)  and  (5.5),  we  obtain  that 

TkM  - [e  (l  * (n-l)  Z j.pji^)  . (5.6) 

" k lijik  <'»" 

To  evaluate  T f^(n),  we  need  to  Know  the  foltowing  quantity: 

■ l^ik^'^j^^' 

Lemma  5.1: 


For  j m l,  k: 

J.M 

' kj*^(k-j)\ 


Proof: 


We  first  observe  that,  by  assumption  A2,  (or  r ■ f,  2,  ...,  any  one  of  the  k processes 

is  equatly  liKely  to  complete  a task  at  time  t^.  Suppose  that  and  ik,i*l  “ ^r*j' 

Then,  by  condition  (a)  of  Theorem  3.1,  the  j processes  completing  tasks  at  time  -.i 

^r*j-l  *'’®  different.  This  occurs  with  probability 

4 X X . (5.8) 

k k k kJ(k-j)\ 

Moreover,  by  condition  (b)  of  Theorem  3.1,  the  process  completing  a task  at  lime  must 
be  one  of  the  j processes  mentioned  above.  This  occurs  with  probability  i/k.  Honce  the 
probability  that  *k  i ” *k  i*l  “ ^r*j 

ix  — itl — . ■ 

k kJ(k-J)} 

The  problem  of  computing  the  leading  terms  in  the  asymptotic  series  for  Is 
rather  difficult.  Fortunately,  some  known  results  can  be  used  here.  Define 

lijik  kHk-j)\ 
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Wo  arc  now  able  lo  establish  Iho  following. 


Lemma  5.2: 


- Qk 


Proof: 


We  have 


r i.k\ 


. Z T-fei  - 

lijik 


. Z - j-fe!  - z 

i^jik  hJ(k-J)}  khk-jn 

- 

The  leading  terms  in  the  asymptotic  series  for  (?^  are  known  [34,  p.  118]: 

Ql  - - I * ±/E  * 0(1). 

2 3 12''  2k  k 

Hence,  by  equations  (5.1),  (5.6)  and  Lemma  5.2,  we  have  the  following  theorem. 


Theorem  5.1 


Using  k processes,  the  speed-up  ratio  is  given  by 


S,,(n)  - 


n.k 

1 * (n-J)Nj^  ’ 


where 


Nj,  - - i * ♦ o4). 

''2  3 12'' 2k  k 

Asymptotically,  when  both  n and  k arc  large,  we  obtain; 
s*'")  - y?  - 0.798  k/k  . 


6 - Speed-up  ratios:  Implementations  with  critical  sections 


In  this  section,  we  analyze  speed-up  ratios  achievable  by  the  algorithms  when  they 
arc  Implemented  with  critical  sections. 


I 


U*2  *i*4 

Figure  6.1  - A possible  task  scheduling  with  two  processes 


In  the  diagram,  the  marks  *— f— ’ and  '—o—'  indicate  the  sequences  of  time  instants  and  v^, 
i - I,  2,  when  a process  completes  a task  and  when  the  same  process  completes  the 
subsequent  critical  section.  Since,  at  any  time,  only  one  process  can  execute  the  critical 
section,  a process  may  have  to  wait  before  entering  the  critical  section.  The  periods  of 

t 

waiting  limes  are  Indicated  by  the  marks  The  lime  instants  when  processes 

actually  enter  the  critical  section  are  indicated  by  the  marks 

As  in  the  preceding  section,  we  assume  that  the  time  a process  takes  to  execute  a 
tack  is  a random  variable  independent  of  the  process  and  of  the  task.  Let  F be  its 
distribution  function,  and  / its  density  function.  Similarly,  we  assume  that  the  time  a 
process  takes  to  execute  the  critical  section  is  a random  variable  Independent  of  the 
process.  Let  B be  its  distribution  function  and  6 its  density  function.  Furthermore,  let  e 
and  denote  the  average  execution  times  for  a task  and  for  the  critical  section, 
respectively. 

In  the  following  we  derive  a general  formula  for  evalualing  the  speed-up  ratio 
achievable  by  the  parallel  algorithm  with  two  processes  for  the  case  when  F is  an 
mxpon€ntial  distribution  function  and  B is  a gsneral  distribution  function. 


Observe  that  at  time  when  a process  enters  the  critical  seclion,  the  second 
process  is  necessarily  performing  some  task  (possibly  just  starting  a task).  Since  the 


1 
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distribution  function  F is  exponential,  at  lime  the  remaining  execution  time  for  the  task 
performed  by  the  second  process  is  distributed  according  to  the  same  distribution  function 
F.  Therefore  the  evolution  of  the  processes,  from  time  on.  Is  independent  of  the  past 
for  any  distribution  B.  In  particular,  the  random  variables  - t^,  for  i - i,  2,  ...,  are 
independent  and  identically  distribuled,  and  the  same  holds  for  the  random  variables 
for  i « 1,  2,  ...,  defined  in  Section  3. 

In  this  section,  let  T jfn)  and  T 2(n)  denote  the  time  to  complete  task  and  the 
subsequent  critical  section  by  the  sequential  algorithm  and  the  parallel  algorithm  with  two 
processes,  respectively.  Let  T j(n)  and  T 2(n)  denote  their  means.  It  follows  from  the 
above  discussion  that,  for  k - I and  2,  we  have: 

Ti^<n)  - r(I)  * r(2)  * ...  * K(n)  * , (6.1) 

where  the  last  term,  /?,  accounts  for  the  lime  to  execute  the  last  critical  section  (after  the 
completion  of  task  u/^). 

Consider  first  the  sequential  algorithm.  In  this  case,  we  simply  have  k(1)  - er,  and, 
for  t - 2,  ...,  n,  e(i)  ••  /3  + c.  Therefore,  by  equation  (6.1):  • 

T j(n)  - n{z  * (i)  . (6.2) 

(Hero  we  ignore  the  fact  that  in  the  sequential  algorithm  the  critical  section  can  be 
shortened,  since  there  is  no  need  to  include  synchronization  primitives.) 

Consider  now  the  parallel  algorithm.  As  with  equation  (5.3),  we  have: 

^2^1)  " (6.3) 

For  y » 1 and  2,  let  p ^ be  the  probability  that  S2  i*/  - given  that  *2  i “ 
some  r.  As  in  Section  5,  by  Theorem  3.1,  we  obtain,  for  i » 1,  „.,  n-I, 

with  probability  , 

(6.4) 

(tf^j  - if)  * (tf.^2  ~ ^r*l^  probability  P2  , 

We  have  already  mentioned  that  the  random  variables  - t^,  r - 1,  2 are  Independent 

and  identically  distributed.  Let  p denote  their  mean.  It  follows  from  equation  (6.4)  that 
the  mean  of  is  given  by: 


A 
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1 2(1*1)  ■ fi.pi  * 2tt.p2  - (2-pj).fi  , 
since  pj  * P2  “ 1. 

The  following  lemma  establishes  the  values  of  fi  and  pj. 


(6.5) 


Lemma  6.1 1 


Let  B*  denote  the  Laplace  transform  of  the  distribution  function  B.  We  have; 
p - /3  ♦ |B*r^h  (6.6) 

Pi  - (6.7) 

e 

Proof; 

We  consider  transitions  for  passing  from  time  to  time  Up  to  a permutation  of 
the  processes,  there  are  three  possible  transitions  as  defined  by  the  following  diagrams: 


^i*l 

o o ■ ^ 


^3= 


h*i 


^i*J 


whore  the  notation  of  Figure  6.1  is  assumed. 


Let  Hj(t),  j » 1,  2,  and  3,  be  the  probability  that  transition  Aj  takes  place  and  that 
^ We  have: 

H^Ct)  - [I  - F(x)]  b(y)  Kx~y)  dy  d%  , 

H2(t)  - f(*)  b(y)  [/  - F(x-y)]  dy  dx  , 

H,(t)  ~ /b(x)F(x>dx. 

^ 0 

But  we  observe  that  H(t)  ■ H^(t)  * H2(t)  * H^(t)  is  the  distribution  function  for  and 

that  the  same  process  enters  the  critical  section  at  both  times  and  only  with 

transition  Aj.  Hence: 

p - f^tdH(t)  - /;”[1  -H(t)]dt, 

0 0 

Pi  - [I  - F(x)] /*  b(y)  f(x-y)  dy  dx  , 

from  which  equations  (6.6)  and  (6.7)  follow  easily.  I 


By  collecting  the  preceding  results,  we  obtain  the  following  theorem. 
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Thoorom  6.1 : 


The  speed-up  ratio  of  the  parallel  algorithm  with  two  processes  is  given  by: 

S.(n)  = n_^_LiL! 

- I B*(p][/3  * I B*(p]  ♦ I * /3 


We  give  below  B*(^)  for  some  distribution  functions  B. 
.(i)  B is  exponential  (with  parameter  t/fih 


(li)  B is  uniform  oyer  [a,  6]; 

B>a) . 

(6-aVK 

(iil)  B is  the  Dirac  function  at  the  point  /?: 

B*(^)  - 

In  Figure  6.2,  v/e  have  plotted  the  asymptotic  speed-up  ratio  $2  as  a function  of  the 
ratio  tx  » r/(?:*/3)  for  the  three  distributions  mentioned  above  (in  the  second  case,  a and  6 
have  been  cliosen  as  /?/2  and  3/3/2,  respectively). 

When  a"  lends  to  0 (or  /3  lends  to  infinity),  the  algorithm  approaches  its  worst  case 
performance,  since  the  evaluations  of  the  two  processes  tend  to  be  exactly  Interleaved. 
When  ex  » 1 (or  ft  - 0),  the  critical  section  is  non-existent  and  we  have  the  results  of 
Section  5. 

Wo  observe  from  Figure  6.2  that  the  best  speed-up  ratio  is  always  obtained  when  B ' 
is  an  exponential  distribution  (the  first  case).  We  also  note  that  the  results  obtained  for 
the  two  other  cases  are  very  close  to  each  other  and  close  to  the  results  obtained  with 
the  exponential  distribution.  This  suggests  that  the  results  obtained  with  the  exponential 
distribution  could  be  used  as  approximations  to  results  obtained  with  other  distributions. 


I 


I 


Ratio  u 

Figure  6.2  - Speed-up  ratio  with  2 processes  lor  various  distributions  6 


We  can  observe  from  Figure  6.2  that,  unlike  the  implementation  without  critical 
section,  better  speed-up  is  not  necessarily  achieved  by  using  more  processes,  though  we 
assume  that  a processor  is  always  available  to  each  process!  More  precisely,  the  figure 


I 


r 
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IndlcaleG  that  (when  B Is  an  exponential  distribution)  in  order  to  achieve  the  best 
speed-up  when  two  processors  are  available,  one  should  create  two  processes  when 
tv  > 0.586,  but  only  one  process  when  o'  s 0.586.  Similar  results  are  useful  in  practice, 
since  they  can  be  used  to  determine  the  optimal  number  of  processes  to  create  In  order  to 
minimize  the  overall  execution  time. 

7 - Conclusions  and  open  problems 
/ 

In  recent  years,  research  in  parallel  algorithms  has  dealt  mostly  with  synchronized 
array  or  vector  processors  such  as  the  ILLIAC  IV  or  the  CDC  STAR,  and  there  are  very  few. 
results  on  the  design  and  analysis  of  algorithms  for  asynchronous  multiprocessors.  In  this 
chapter,  we  have  proposed  a novel  method  of  using  asynchronous  multiprocessors  which 
takes  advantage  of  their  asynchronous  behavior.  Wo  have  also  presented  analytic 
techniques  to  evaluate  the  performance  of  an  asynchronous  algorithm  using  the  method. 
The  algorithm  is  expected  to  achieve  a large  speed-up  when  the  fluctuations  in  the  task 
execution  times  are  relatively  large.  Moreover,  as  noted  in  Section  4,  the  algorithm  has  a 
nice  reliability  property.  The  same  idea  may  also  be  used  to  construct  other  reliable 
algorithms. 

For  the  implementation  with  critical  sections  we  obtained  analytic  results  for  two 
processes.  The  results  show  that  the  parallel  algorithm  using  two  processes  is  not 
necessarily  (aster  than  the  sequential  algorithm,  because  of  the  critical  section  overheads 
associated  with  the  parallel  algorithm.  This  confirms  the  practical  experience  that  the 
speed-up  ratio  docs  not  necessarily  increase  as  the  number  of  processes  increases.  li 
would  be  interesting  to  extend  our  analytic  results  for  more  than  two  processes.  We  have 
chosen  to  deal  with  a simple  problem  by  imposing  the  condition  that  the  tasks  are  linearly 
ordered.  An  interesting  extension  would  be  to  consider  a set  of  tasks  (possibly  generated 
dynamically)  which  are  ordered  by  a directed  graph  (I,  e.,  partially  rather  than  linearly 
ordered).  Another  interesting  extension  would  be  to  design  algorithms  where  the 
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eyeculion  of  a task  by  a process  may  be  Interrupted  by  another  process.  We  expect  that 
this  approach  would  result  in  more  efficient  algorithms,  since  processes  which  are  not 
doing  useful  work  can  be  interrupted.  A careful  performance  analysis  Including  the 
additional  overheads  Inlroduced  by  Ihe  interrupUon  mechanism  is  needed  here.  This 
problem  has  been  addressed  in  two  recent  papers  by  Barak  and  Downey  [3]  and  [A]. 

Finally,  we  note  that  the  results  of  this  chapter  are  not  restricted  to  multiprocessor 
systems.  The  ideas  can  be  used  lo  solve  any  problem  In  Operations  Research  which 
satisfies  conditions  similar  to  Cl,  C2  and  C3. 


Chapter  III 


Asynchronous  Iterative  Methods 
for  Multiprocessors^ 


1 - Introduciion 

In  this  chapter  we  investigate  the  fixed  point  problem  for  an  operator  F from 
into  itself:  we  want  to  find  a vector  * in  /R”  which  satisfies  the  system  of  equations 
represented  by 

x-F(x).  (1.1) 

In  [ll]i  Cliazan  and  Miranker  introduced  the  chaotic  relaxation  scheme,  a class  of 
Iterative  methods  lor  solving  equation  (1.1)  where  F Is  a linear  operator  given  by 
F(x)  m Ax  * b.  They  showed  that  iterations  defined  by  a chaotic  relaxation  scheme 
converge  to  the  solution  of  equation  (1.1)  if  and  only  if  p(\A\)  < I.  (If  M is  a real 
/ixft  matrix,  p(M)  denotes  its  spectral  radius  and  |M|  denotes  the  non-negative  nxn  matrix 
obtained  by  replacing  the  elements  of  M by  their  absolute  values.) 

In  [41]  and  [43],  Micllou  generalized  the  chaotic  relaxation  scheme  to  include 
non-linear  operators  and  obtained  convergence  results  similar  to  those  of  [11]  in  the  case 
of  contracting  operators  (see,  for  example,  [46,  p.  433]). 

In  [11],  [41]  and  [43],  the  molivalion  of  defining  chaotic  relaxation  Is  to  account  for 
the  parallel  implementation  of  Iterative  methods  on  a muttiprocessor  system  so  as  to 

^(3opyright  1978,  Association  for  Computing  Machinery,  Inc.,  reprinted  by  permission. 
This  chapter  appeared  in  Journal  of  the  ACM,  Vol.  25,  No.  2,  April  1978,  pp.  226-244. 
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reduce  communication  and  synchronization  between  the  cooperating  processes.  This 
reduction  is  obtained  by  not  forcing  the  processes  to  follow  a prodcterminod  sequence  of 
computations,  but  simply  by  allowing  a process,  when  starting  the  evaluation  of  a new 
iterate,  to  choose  dynamically  not  only  the  components  to  be  evaluated  but  also  the  values 
of  the  previous  iterates  used  In  the  evaluation. 

The  chaotic  relaxation  scheme  does  not,  however,  allow  (or  a completely  arbitrary 
choice  of  the  antecedent  values  used  in  the  evaluation  of  an  iterate.  A restriction  is  that 
there  must  exist  a fixed  positive  integer  s such  that,  in  carrying  out  the  evaluation  of  the 
t-th  iterate,  a process  cannot  make  use  of  any  value  of  the  components  of  the  y'-th  iterate 
if  j < is.  Wo  will  show  that  this  condition  can  be  replaced  by  a more  general  one,  which 
still  guarantees  the  convergence  of  the  iteration. 

In  the  next  section  we  introduce  the  class  of  asynchronous  iterative  methods  which 

relaxes  the  assumption  mentioned  above,  and  we  show  that  existing  iterative  methods  (and, 

in  particular,  the  chaotic  relaxation)  can  be  represented  as  special  cases  of  asynchronous 

iterations.  Section  3 gives  the  definition  and  reviews  some  properties  of  contracting 

operators.  Then  the  theorem  of  Section  4 generalizes  the  sufficient  condition  on  the 

convergence  of  the  chaotic  relaxation  obtained  by  Chazan  and  Miranker[ll]  and  by 

Miellou  [41]  and  [43],  This  result  is  further  extended,  in  Section  5,  to  Include  Iterative 

methods  with  memory.  In  Section  6,  we  consider  the  complexity  of  asynchronous  Iterative 

methods,  and  we  derive  bounds  on  the  efficiency.  These  bounds  are  then  compared  with 
/ 

actual  measurements  of  asynchronous  iterations.  The  experimental  results,  presented  in 
Section  7,  show  a considerable  advantage  for  iterations  making  no  use  of  synchronization. 
Section  8 is  devoted  to  the  study  of  an  asynchronous  iteration  showing  super -linear 
convergence  and,  through  a specific  analysis,  we  give  lower  bounds  on  the  order  of 
convergence  and  on  the  efficiency.  . Possible  extensions  of  the  results  are  discussed  in 
Section  9,  and  concluding  remarks  are  presented  in  the  last  section. 
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2 - The  class  of  asynchronous  iterative  methods 

» 

The  following  notations  will  be  used  Ihroughout  the  chapter.  If  a:  is  a vector  of  IR'*, 

its  components  will  be  denoted  by  i « / n.  To  avoid  confusion,  a sequence  of 

vectors  of  witl  be  denoted  by  x(j),  j - 0,  f If  F is  an  operator  of  JR”  into  itself, 

F(x)  will  also  be  represented  in  components  by  {^(x)  or  by  ...»  f ■ I, n.  We 

denote  by  /N  the  set  of  all  non-negative  integers. 


2.1  -*  Definition  of  asynchronous  iterative  methods 

The  definition  of  chaotic  iteration  is  originally  due  to  Chazan  and  Miranker  [11],  and 
the  definition  we  give  below  for  asynchronous  iteration  Is  similar  to  their  definition. 

Definition  2.1 ! 

Let  F be  an  operator  from  /R”  to  R”.  An  asynchronous  iteration  corresponding 

to  the  operator  F and  starting  with  a given  vector  x(0)  is  a sequence  r(j),  J m Q,  1 

of  vectors  of  ff?”  defined  recursively  by; 

( x^(j-i)  if  i e Jj 

. ^ ' (2.1) 
I fi(xj(si(j» * f Jj  . 

whore  ^ - {Jj  \ J <•  1,  2, ...  } is  a sequence  of  non-empty  subsets  of  [1,  ...,  n)  and 
^6  » { (sj{J), !;’•(,  2,  ...  } is  a sequence  of  elements  in  fV”, 

In  addition,  ^ and  /d  arc  subject  to  tho  following  conditions; 
for  each  i - ...,  n 

(a)  s^(J)  i j-i,  j - I,  2, ..., 

(b)  Sj(j)t  considered  as  a function  of  y,  tends  to  infinity  as  j tends  to  infinity, 

(c)  i occurs  infinitely  many  often  in  the  sets  Jj,  J - 1,  2, .... 

An  asynchronous  iteration  corresponding  to  F,  starting  with  aefO)  and  defined  by 
^ and  /d  will  be  denoted  by  (F,x(0),J,.4>).  I 
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In  the  definition  of  chaotic  iterations,  Chazan  and  MiranKer  [11]  use  the  foltowlng 
/ 

condition 

(b')  there  exists  a fixed  integer  s such  that  J - s^(j)  i s for  j - /,  2,  ...  and  i - f,  n, 
in  lieu  of  condition  (b).  Clearly,  condition  (b')  implies  condition  (b),  and,  in  this  sense, 
asynchronous  iterations  provide  a generalization  of  chaotic  relaxations. 

An  asynchronous  iteration  (F may  be  thought  of  as  corresponding  to  the 
following  sequence  of  computations  on  an  asynchronous  multiprocessor. 

Assume  we  have  a pool  of  processors  available.  Let  tj,  j » /,  2, ...,  be  an  Increasing 
sequence  of  time  instants.  At  lime  ty  processor  P is  idle  and  is  assigned  to  the  evaluation 
of  the  iterate  *(j),  x(j)  differs  from  x(J-l)  by  the  set  of  components  {x^\iCJj]  and  P 
starts  computing  those  components  using  values  of  components  known  from  previous 
iterates,  namely  the  r-th  component  of  the  f,.0'.)-lh  Iterate,  for  r - ...,  n.  The  choice  of 

the  components  may  be  guided  by  any  criterion,  and,  in  particular,  a natural  criterion  is  to 
pick  up  the  most  recently  available  values  of  the  components.  This  scheme  does  not 
require  any  synchronization  between  the  processes.  At  some  lime  f^,  later  on  (k  > J),  P 
will  finish  Its  computations  and  will  be  assigned  to  a new  evaluation:  x(k). 

The  use  of  asynchronous  iterative  methods  Is  obviously  not  restricted  to 
multiprocessor  systems,  and  the  scheme  Is  also  well  suited  for  execution  on  a network  of 
computers,  in  particular,  when  the  communication  between  elements  of  the  network  is  not 
too  expensive  as  opposed  to  the  computation  Itself. 

We  notice  that.  In  the  evaluation  of  an  iterate,  nothing  Is  imposed  on  the  use  of  the 
values  of  the  previous  iterates.  The  only  thing  required,  by  condition  (b)  of  the  definition, 
is  thal,  eventually,  the  values  of  an  early  iterate  cannot  be  used  any  more  in  further 
evaluations,  and  more  and  more  recent  values  of  the  components  have  to  be  used  Instead. 
On  a multiprocessor,  Ihls  condition  can  bo  satisfied  as  long  as  no  processor  crashes  (and 
eventually  completes  its  computation). 
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Condition  (a)  of  the  definition  states  the  fact  that  only  components  of  previous 
iterates  can  be  used  in  the  evaluation  of  a new  iterate.  Condition  (c)  guarantees  that  no 
component  be  abandoned  forever. 

2.2  ~ Examples  and  particular  cases  of  asynchronous  iterations 

Classical  iterative  methods:  point  or  block  Jacobi,  Gauss-Seidel,  etc.,  as  welt  as 
others  introduced  more  recently:  chaotic  relaxation  ic/jcnie  [1 1],  periodic  chaotic 
scheme  [18],  iteration  chaotique  a retards  [41]  and  [43],  iteration  ehaotique 
serie- par  allele  [50],  can  all  be  seen  as  particular  cases  of  asynchronous  iterations. 

For  example,  the  point-Jacobl  method  defined  on  the  operator  F with  the  initial 
approximation  x<0)  can  be  represented  by  the  asynchronous  iteration  (F,x(0)^,/>)  where  ^ 
and  /d  arc  defined  by: 

J j ^ { If ...»  n ) for  j ■ i,  2,  «. , 

s^(J)  - j-1  lor  J m 1,  2, ...  and  t - 1, ...,  n . 

The  same  point-Jacobi  method  can  equivalently  be  represented  by  the  asynchronous 
iteration  where  ^ and  /<5  arc  defined  by: 

Jj  “ { t * (J-1  mod  n)  ) for  J - 1,  2, ... , 

s^(J)  - n [ (J-l)/n  J for  j - j,  2,  „ and  i - /, ...,  n . 

Although  those  two  representations  correspond  to  the  same  point-Jacobl  method, 
they  differ  by  the  implicit  information  they  contain  about  the  decomposition  of  the 
computations.  In  the  first  case,  all  components  are  evaluated  at  once  and  this,  presumably, 
will  be  done  by  one  computational  process.  In  the  second  case,  however,  each  component 
Is  evaluated  separately,  and  up  to  n processes  can  be  used  to  perform  the  evaluations. 
Between  the  two  extreme  representations  of  the  point-Jacobi  method,  in  terms  of 
asynchronous  iterations,  several  others  can  be  proposed,  each  of  which  can  be  interpreted 
In  terms  of  decomposition  Into  computational  processes  and  in  terms  of  Implementation  by 
concurrent  processes.  , 
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The  iterative  method  proposed  by  Robert,  Charnay  and  Musy  {iteration  chaotique 
serie- par  allele  [50])  can  be  obtained  as  a spociai  case  of  an  asynchronous  iteration  in 
which  tj(J)  m j-t  (for  alt  t - 1, ...»  n and  j - 1,  2, ...).  This  corresponds  to  a strietty 
sequential  computation  of  sets  of  components.  The  choice  of  the  components  within  a set 
is  arbitrary  and  the  calculations  of  their  values  can  be  done  simultaneously  but  the 
evaluation  of  a new  set  of  components  cannot  be  started  before  all  components  of  the 
previous  set  have  been  computed  and  their  new  values  relaxed.  The  goal  of  their 
research  was  to  show  that,  for  example.  In  the  iterative  solution  of  linear  systems 
resulting  from  the  application  of  the  method  of  finite  differences  to  partial  differential 
equations,  it  is  possibie  to  concentrate  the  computations  more  on  those  points  of  the  grid 
whore  the  convergence  is  slower  than  on  other  nodes.  This  is  not  the  case  with  ordinary 
iterative  methods  for  which  any  component  is  iterated  as  many  times  as  any  other 
component. 

Chazan  and  MiranKer[ll]  have  proposed  a chaotic  relaxation  scheme  to  solve  a 
linear  system.  As  we  have  already  mentioned,  our  definition  of  an  asynchronous  Iterative 
method  is  similar  to  the  definition  they  give  for  a chaotic  iterative  scheme.  Our  definition, 
however,  does  not  require  the  condition  that  j-Sj(j)  has  to  be  uniformly  bounded  by  some 
fixed  integer,  say  *,  (for  all  i • I,  ~,  n and  j ■ i,  2,  _.).  This  assumption,  however,  happens 
to  be  satisfied  in  most  usual  implementations,  with  small  values  for  s.  It  will  be  useful  in 
Sections  6 and  7,  and  we  will  use  this  assumption  explicitly  in  order  to  derive  bounds  on 
the  rate  of  convergence  and  on  the  efficiency  of  various  methods  Implemented  on  an 
asynchronous  multiprocessor. 

Although  all  chaotic  relaxation  methods  (as  presented  in  [11],  [41]  and  [43])  can  be 
Identified  as  asynchronous  Iterations,  the  converse  is  not  true  as  is  Illustrated  by  the 
following  example.  Let  F be  an  operator  from  #?^  into  itself.  Assume  we  have  two 
processes  and  P2  attached  to  the  evaluations  of  the  first  and  second  components, 
respectively.  To  avoid  synchronization,  the  processes  always  use  in  an  evaluation  the 
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values  of  the  components  currentty  available  at  the  beglnlng  of  the  computation.  If  we 
assume  that  It  always  takes  J unit  of  time  for  Pj  to  perform  the  evaluation  of  xj  and  It 
takes  k units  of  lime  for  P2  to  perform  the  fc-th  evaluation  of  *2t  quantity 

J ■ grows  as  •/]  which  is  unbounded.  This  Iteration  is  a legitimate  asynchronous 
Iteration,  it  is  not,  however,  allowed  in  the  setting  of  [11],  [41]  and  [43]. 

3 - Contracting  operators 

In  the  next  section  we  shall  give  a sufficient  condition  on  the  operator  F for  the 

a . 

convergence  of  any  asynchronous  iteration.  Needed  definitions  are  given  in  this  section. 

3.1  ■ Lipschitzian  and  contracting  operators 

Contracting  operators,  to  be  defined  below,  correspond  to  P -contractions 
in  [46,  p.  433].  They  seem  to  have  been  first  introduced  by  Kantorovitch,  Vulich  and 
Pinsker  in  [31],  and  they  have  been  further  studied  by  Robert  [49].  The  notion  was  used 
in  particular  to  obtain  the  results  of  [10],  [41],  [43]  and  [50]. 

Definition  3.1 : 

An  operator  F from  IR'*  to  IR^  is  a Lipschitzian  operator  on  a subset  D of  f?^  if 
there  exists  a non-negative  n».n  matrix  A such  that: 

\F(x)-F(y)\  s A\x-y\  , 'ix,yC0,  (3.1) 

where,  if  z is  a vector  of  IR'^  with  components  i - I, ...,  n,  |z|  denotes  the  vector 
with  components  |z^|,  i - I and  the  inequality  holds  for  every  component. 

The  matrix  A will  be  called  a Lipschitzian  matrix  for  the  operator  F.  I 

From  this  definition  we  can  see  that  any  Lipschitzian  operator  is  continuous  and,  in 
fact,  uniformly  continuous  on  0.  However,  this  definition  is  loo  broad  and,  in  particular, 
we  arc  not  guaranteed  of  the  existence  and  of  the  uniqueness  of  a fixed  point  as  Is  shown 
by  the  following  example.  Take  the  operator  F from  f?  to  I?  defined  by  F(x}  - Y x^*a^, 
this  operator  is  Lipschitzian  on  iR  because 
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\r(z)-F<y)\  - \(z-y)\(x*y)/U  ♦ /7^n\  s \*-y\  , 'i  x,y  OR. 

However,  the  equation  x - •/ x^*l  (corresponding  to  a - /)  has  no  solution.  On  the  other 
hand,  the  equation  x » |a:|,  (corresponding  to  a - 0)  has  an  infinity  of  solutions,  and.  In 
fact,  a continuum  of  solutions. 

We  will,  therefore,  restrict  ourselves  to  the  following  class  of  operators. 

Definition  3.2: 

An  operator  F from  JR'*  to  JR'*  is  a contracting  operator  on  a subset  D of  //?'*  if- it 
is  a Lipschitzian  operator  on  D with  a Lipschitzian  matrix  A such  that  p(A)  < J (where 
p(A)  is  the  spectral  radius  of  A). 

The  matrix  A will  be  called  a contracting  matrix  for  the  operator  F.  I 

The  fact  that,  unlike  Lipschitzian  operators,  contracting  operators  are  guaranteed  to 
have  a unique  fixed  point  in  the  subset  D can  be  easily  derived  from  the  definition.  In 
addition,  if  we  assume,  for  example,  that  D is  closed  and  that  F(D)  is  a subset  of  D,  we  are 
also  guaranteed  of  the  existence  of  a fixed  point  in  the  subset  D.  A proof  can  be  found 
- in  [46,  pp.  433-434], 

3.2  - Examples  of  contracting  operators 


Let  r be  a linear  operator  given  by  Fix)  m Ax  * b,  where  A is  an  nxn  matrix  and  b is 
a vector  of  Jf?'*.  Wo  observe  that  T is  a contracting  operator  if  and  only  If  p(\A\)  < J. 
Therefore,  in  the  case  of  linear  operators,  the  notion  of  contracting  operators  coincides 
with  the  property  stated  by  Chazan  and  Miranker  for  their  convergence  result  [11],  and 
their  resutt  will  appear  as  a particular  case  of  the  theorem  of  the  next  section. 

We  could  hove  considered  a more  general  definition  for  asynchronous  Iterative 
methods  by  Introducing  a relaxation  factor  a > 0.  This  would  simply  consist  of  replacing, 
in  equations  (2.1),  the  operator  F by  the  operator  ■ of  * (l-ci)E,  where  £ is  the 
identity  operator  of  JR'*.  It  follows  that 
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s a\FM-F(y)\  * |l-«ll*-y|  , 

and,  if  f is  a contracting  operator  with  a contracting  matrix  A,  is  a Lipschitzian 
operator  with  the  Lipschitzian  matrix  A^  - aA  ♦ |l-o|/.  The  matrix  A being  non-negative 
we  have  p(A^)  « cip(A)  ♦ |l-o|,  and,  if  we  choose 

0 < o<  2AJ*p(A)]  . (3.2) 

F^  Is  also  a contracting  operator.  In  particular,  as  long  as  condition  (3.2)  Is  satisfied,  the 
results  of  the  next  section  also  apply  to  asynchronous  iterative  methods  with  relaxation. 
Condition  (3.2)  is  classical  and  is  mentioned,  in  particular,  in  [1 1,  p.  221],  [43,  p.  62], 
and  [50,  p.  31]. 

If  we  consider  a linear  system  of  equations  derived  from  a linear  elliptic  differential 
equation  by  the  method  of  finite  differences,  we  note  that  the  system  is  represented  by 
Ax  ” b,  where  6 is  a vector  of  If?'*  obtained  from  the  boundary  conditions  and  A is  an 
riy.n  M-matrix  (see,  for  example,  [62,  p.  8b]).  Therefore  the  system  can  be  written  into  the 
form  of  equation  (1.1)  in  which  F is  the  contracting  operator  given  by 
Ffx)  - (1  - D~^A)x  * D~^b,  where  D is  the  matrix  composed  of  the  diagonal  elements  of  A. 
This  example  shows,  in  the  case  of  linear  operators,  the  Importance  of  contracting 
operators. 

On  the  other  hand,  non-linear  contracting  operators,  too,  constitute  a very  important 
class.  A first  example  is  directly  derived  from  the  previous  one.  Elliptic  partial 
differential  equations,  obtained  by  the  addition  of  a small  non-linear  perturbation  to  a 
linear  parti'al  differential  equation,  can  also  be  shown  to  give  rise  to  (non-linear) 
contracting  operators. 

More  important,  if  C is  a non-linear  operator  from  f?”  into  itself  with  the  simple 
root  f,  superlinear  Iterative  methods  have  been  devised  to  find  the  root  of  C,  provided 
that  an  initial  approximation  x(0)  sufficiently  close  to  { Is  already  known.  For  example, 
Newton  iterative  method  generates  the  sequence  of  iterates 

- F(x(i))  - xd)  - , for  i - 0,  1, ... , 

I 
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which  converges  quadratically  to  the  root  f of  C.  In  this  particular  example,  we  can  easily 

derive,  under  usual  assumptions  (for  example,  C*  satisfies  some  Lipschilz  condition  in  a 

* , 

neighborhood  of  f),  that  the  Nijwlon  operator.  F corresponding  to  C Is  a contracting 
operator.  (This  result  will  be  derived  in  a more  general  context  in  Section  8.) 

In  fact  this  result  is  very  general.  Let  F be  an  operator  from  ff?'*  Into  Itself  with  a 
fixed  point  f.  If  we  assume  that  F is  continuously  differentiable  In  the  set 
Of.  - { X \ ||r-f  II  < r ) and  that  the  derivative  F'  vanishes  at  f and  satisfies  a Llpschitz 
condition 

lirY*;-rY*;ii  s Miir-yii , s x,  y a o,, 

then  it  can  be  easily  shown  thal 

||Frr.>-F(y;i|  s 2Afr||r--||  . W x,  y C D,. . 

Therefore,  by  choosing  Ihe  vector  norm  ||x||  » |»^|  * ...  * |r^|  (which  only  changes  the 
constant  M),  the  operator  F is  certainly  a Lipschilzian  operator  with  the  Lipschitzian 
matrix  A « [ajy]  where  a^j  » 2Mr,  for  i,  j « I, ...,  n.  In  particular,  if  we  Know  a sufficiently 
ctose  approximalion  to  the  fixed  point  (i.  e.,  if  r is  small  enough),  the  operator  F is  also 
a contracting  operator.  This  shows  that  the  class  of  contracting  operators  contains,  under 
•weak  conditions,  all  iterative  functions  occurring  in  the  classical  superlinear  iterative 
methods. 

4 - Convergence  theorem 

Before  stating  a sufficient  condition  ensuring  the  convergence  of  an  asynchronous 
iteration,  we  give  a characterization  of  a non-negative  matrix  with  spectral  radius  less 
. than  unity.  The  result  Is  classical  and  an  algebraic  proof  of  this  characterization  can  be 
found  in  [1 1,  p.  218].  A shorter  proof,  based  on  the  continuity  of  the  spectral  radius  of  a 
matrix  as  a function  of  Its  coefficients,  is  given  below. 

Lemma  4.1 1 

Let  A be  a non-negative  square  matrix.  Then  p(A)  < / If  and  only  If  there  exists 
a positive  scalar  « and  a positive  vector  v such  that: 
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fliV  £ cji/  and  o < 1 . (4.1) 

Proof: 

Wo  first  assume  that  (4.1)  holds.  In  this  case  we  note  that  ||/I||„  a a < t,  whore  the 
matrix  norm  |l.||y  is  induced  by  the  vector  norm  defined  by: 

ll*Ily  - kjl/v*  I i - 1.  ) . 

Therefore  the  matrix  /I  is  convergent  which  implies  < 1 (see,  for  example,  [62,  p.  13]). 

Now  assume  that  p(A)  < t.  Let  t bo  a non-negative  scalar  and  Af  be  the  matrix 
obtained  by  adding  t to  all  null  coefficients  of  A.  Clearly,  for  any  positive  vector  *,  we 
have  Ax  s A^x.  On  the  other  hand,  p(A^)  is  a continuous  function  of  t.  In  particular,  since 
Aq  A and  p(A)  < J,  we  can  always  choose  t > 0 small  enough  so  that  p(A^)  < 1 (in  fact,  we 
also  have  p(A)  < pf/Ij)).  Then  let  a «•  p(A^).  As  A^  > 0,  from  Perron’s  theorem  (see,  for 
example,  [62,  p.  30]),  there  exists  a positive  eigenvector  u corresponding  to  the 
eigenvalue  o.  The  positive  scalar  o and  the  positive  vector  v verify  Av  s A^v  - ou  with 
o < t.  And  this  completes  the  proof.  ■ 

This  proof  shows,  in  particular,  that  o i p(A).  But,  we  also  see  easily  that  the 
positive  scalar  o can  be  chosen  arbitrarily  close  to  p(A). 

Wo  are  now  able  to  slate  a sufficient  condition  on  the  operator  F for  the 
convergence  of  any  asynchronous  iteration  corresponding  to  F.  Similar  results  were  first 
established  for  chaotic  iterations,  i.  e.,  under  condition  (b'),  by  Chazan  and  Mlranker  [11] 
in  the  case  of  linear  operators,  and  by  Mieltou  [41]  and  [43]  in  the  case  -oT  contracting 
operators.  The  proof  given  hero  follows  the  same  idea  as  in  [11,  pp.  217-2T8]. 

Theorem  4.1: 

If  r is  a contracting  operator  on  a closed  subset  D of  #?”  and  If  F(D}  Is  a subset 


of  D,  then  any  asynchronous  iteration  (F,x(0),H,/i)  corresponding  to  F and  starting  with 
a vector  x(0)  In  0 converges  to  the  unique  fixed  point  of  F in  D. 
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Proof: 


Let  S be  the  unique  fixed  point  of  F.  By  considering  the  operator  F(x*^)-S,  we  may 


assume,  without  toss  of  generality,  that  ^ ••  r(p  ■ 0.  By  setting  y 
the  Llpschitz  condition  on  the  operator  F gives: 

\F(x)\  i A\x\  , X € D . 


S 


In  equation  (3.1), 


Let  A be  a contracting  matrix  for  F and  let  o and  v be  as  defined  in  Lemma  4.1. 
Since  :/  is  a positive  vector,  for  any  starling  vector  x(0}  we  can  find  a positive  scalar  a 
such  that  lx{0)\  s (xv. 


We  will  show  that  we  can  construct  a sequence  of  indices  jp,  p - 0,  i,  ...,  such  that 
the  sequence  of  iterates  of  (F,x(0)„J,/6)  satisfies: 

1*0)1  i yoPv  for  J i Jp  . (4.2) 

As  0 < o < I,  this  shows  that  x(J)  0 as  J -*  oi  and  this  will  prove  the  theorem. 

We  first  show  that  inequality  (4,2)  holds  for  p • 0 if  we  choose  Jq  ■ 0.  That  is,  for 
J i 0 we  have; 

1*0)1  s a'l'  ■ (4.3) 

^^rom  the  choice  of  of,  inequality  (4.3)  is  true  for  j - 0.  Assume,  for  induction,  that  it 
is  true  for  0 s J < k and  consider  *('*•).  Let  * denote  the  vector  with  components 
m.x^(s^(k)),  for  i » /,  ...,  ft.  From  Definition  2.1,  the  components  of  x(k)  are  given  either 
by  Xj(k)  » Xj(k-i)  if  i 0 Jf^,  in  which  case  \x^(k)\  - \x^(k-t)\  s on/^,  or  by  x^(k)  - f^(z)  if 
i f Jff.  In  this  latter  case,  we  note  that,  as  tj[k)  < k (condition  (a)  of  Definition  2.1),  we 
have; 

\F(z)\  i A\z\  i etAv  s ttav 
and  in  particular: 

\rj(k)\  m \f^;z)\  i ofcwi . 

As  0 < o < I,  in  this  case  too  we  obtain  \xj(k)\  i (xi/^  and  (4.3)  is  proved  by  Induction, 
which  shows  that  (4.2)  is  true  for  p > 0 if  we  choose  Jq  - 0. 
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Now  assume  that  y‘  has  been  four:!  and  that  inequality  (A.2)  holds  for  0 i p <q.  Wo 

r 

want  to  find  and  show  that  (4.2)  also  holds  for  p » q. 

First  define  r by 

r - min{  fc  | S j i k Sj(j)  i , for  t - I n } . 

We  see,  from  condition  (b)  of  Definition  2.1,  that  this  number  exists,  and  we  note  that,  from 
condition  (a),  we  have  r > y^_j  which  shows,  in  particular,  that  \x(r)\  s. 

Then  take  y i r and  consider  the  components  of  x(j).  As  above,  let  z be  the  vector 

with  components  z^  - Xj(s^(j)).  From  the  choice  of  r,  we  have  s^(J)  i jq-it  for  i 1,  n, 

and  this  shows  that  |;r|  s In  particular,  using  the  contracting  property  of  the 

operator  F we  obtain: 

\F(z)\  s A\z\  SL  i uuAv  . 

This  inequality  shows  that,  if  i C J , Xj(j)  satisfies: 

* * »! 

On  the  other  tiand,  if  i £ Jj  the  i-th  component  is  not  modified.  Therefore,  as  soon  as  the 
i-th  component  is  updated  between  the  r-lh  and  the  y'-th  iteration  we  have: 

\xj<j)\  s . (4.4) 

Now,  define  y^  as: 

jq  “ y I y i r and  {i, ...,  n}  = U ...  U } 

(this  number  exists  by  condition  (c)  of  Definition  2.1),  then  for  any  y i every  component 
is  updated  at  least  once  between  the  r-lh  and  the  y'-th  iteration  and  therefore  inequality 
(4.4)  holds  for  i - i, ...,  n.  This  shows  that  inequality  (4.2)  holds  for  p « q and  this  proves 
the  theorem.  ■ 

Considering  only  the  class  of  linear  operators,  F(x)  m Ax  * b,  Chazan  and 


Miranker  [11]  have  established  a stronger  result,  namely,  that  the  condition  p(\A\)  < / is 
also  a necessary  condition  (or  the  convergence  of  chaotic  iterations. 
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5 - The  class  of  asynchronous  iterative  methods  with  memory 

The  idea  behind  the  definition  of  asynchronous  iterations,  as  presented  in  Section  2, 
is  to  altow,  in  the  evaluation  of  F(x),  different  (and  independent)  processes  to  compute 
different  subsets  of  the  components.  This  corresponds  to  a natural  decomposition  for  the 
evaluation  of  F(x)  when  the  operator  F is  known  explicitly  by  the  set  of  functions 
fj,  ffi-  This  is  not,  however,  always  so.  For  example,  If  F Is  the  Newton  operator 
corresponding  to  a non-linear  operator  C,  i.  e.:  F(x)  - x - [C'{x)y^C(x),  usually  only  the 
operator  C is  given  and  the  operator  F is  not  known  explicitly.  In  this  particular  case, 
when  two  processors  are  available,  a more  natural  decomposition,  as  proposed  by  Kung 
in  [37],  is  to  have  one  process  computing  the  value  of  C while  the  other  process  uses  this 
value  for  the  evaluation  of  F,  More  precisely,  if  x and  y are  two  global  variables 
containing  the  currant  values  of  the  iterate  and  of  the  reciprocal  of  the  derivative  of  C, 
respectively,  the  two  processes  correspond  to  the  two  following  programs. 

Process  1:  white  (termination  criterion  not  satisfied) 

^ X :■  * - yxGCx). 

Process  2:  white  (termination  criterion  not  satisfied) 
da  y [C‘(x)]-‘.  ' 

Starling  with  the  initial  values  z(0)  and  [C’(x(0}}]''^  for  * and  y respectively,  the 
two  processes  execute  their  programs  asynchronously  and  use  lor  x and  y whatever 
values  are  currently  available  when  needed.  They  implicitly  define  the  sequence  of 
Iterates  x(J),  lor  j - 0,  1,  ...,  through  formulas  of  the  form; 

x(j)  m H[x(J-I),x(kj)] , with  kji  J-I  , (5.1) 

where 

H(x,y)  - X - tGYy)r^C(*)  . 

This  Iteration,  however,  is  not  allowed  in  the  setting  of  (Definition  2.1,  because.  In 
equation  (5.1),  x(J)  is  defined  in  terms  of  two  previous  Iterates.  This  motivates  the  need 
for  a generalization  of  the  class  of  asynchronous  Iterative  methods. 


ASYNCHRONOUS  ITERATIVE  METHODS 


47 


5.1  - Asynchronous  itorations  with  momory 

A generalization  to  Definition  2.1  can  be  obtained  by  noting  that,  If,  for  y - 2,  3, 
it  happens  that  kj  - J-2  in  equation  (5.1),  this  equation  defines  a sequence  of  iterates 
which  corresponds  exactly  to  the  sequence  generated  by  an  iterative  method  with  one 
momory.  This  remark  suggests  the  following  generalization  for  the  problem  stated  In 
equation  (1.1). 


Given  an  operator  F from  {W'*]'”  into  the  problem  Is  now  to  find  a vector  f in 
jR”  such  that: 


S “ 1 m . F(r^ . 

The  vector  I will  still  be  called  a /txed  point  (or  the  operator  F. 


(5.2) 


In  very  much  the  same  way  as  we  introduced  the  class  of  asynchronous  Iterative 
methods  to  solve  equation  (I.l),  we  now  introduce  the  class  of  asynchronous  iterative 
methods  with  memory  to  solve  equation  (5.2). 


Dofinition  5.1: 

Let  F be  an  operator  from  into  An  asynchronous  iteration  with 

memory  corresponding  to  the  operator  F and  starling  with  a given  set  of  vectors 
*T0),  ...,  x(m-I)  is  a sequence  x(J),  j » 0,  I, ...,  of  vectors  of  defined  for 

j ■ m,  m*l,  by: 

j x^(J-U  if  iZJj 

[ f^(xh  /"*)  If  iC  Jj, 

where  z'',  1 s r s m,  is  the  vector  with  components  zT  ■ Xi(sF(J)},  1 & i i n.  As  in 
Definition  2.1,  J " I Jj  ] J ••  m,  m*l,  ) is  a sequence  of  non-empty  subsets  of 
{i,  ...,  n)  which  correspond  to  the  subsets  of  components  evaluated  at  each  step  of  the 
Iteration.  But  the  sequence  /d  is  now  to  be  replaced  by: 

/d  - I (sj^(J) I y - - ) . 

a sequence  Of  elements  In  [W'*]'”.  In  addition,  while  condition  (c)  of  Dofinition  2.1 
remains  the  same,  conditions  (a)  and  (b)  now  become: 
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(or  each  i - i,  ..•>  n 

(a)  max{  %[(})  \ t i r i m ] i j-l,  for  j - m,  m*l, 

(b)  min{  s^^(j)  | / s r s m ) tends  to  Infinity  as  j tends  to  Infinity. 

An  asynchronous  iteration  with  memory  corresponding  to  F,  starting  with  a set  X 
of  m vectors  and  defined  with  J and  />  will  be  denoted  by  (F,X,i},/6).  ■ j 

For  practical  reasons  (e.  g.,  stability  in  the  implementation  on  a computer),  we  might  j 

want  to  have  the  additional  condition  that  the  vectors  arc  all  distinct.  But  this  i 

restriction  Is  not  essential  for  our  purpose  here  if  we  assume,  for  example,  that  the  ! 

operator  F is  defined  by  continuity  when  two  or  more  vectors  arc  identical.  This  will  be  ■ 

the  case  with  the  class  of  operators  we  will  consider. 

In  order  to  obtain,  for  asynchronous  iterations  with  memory,  a convergence  result  , 

similar  to  the  result  slated  in  Theorem  4.1,  we  need  to  generalize  the  notion  of 
contracting  operators  to  operators  from  into 

i 

i 

In  the  remainder  of  the  section,  we  will  use  the  following  notation.  If  \x^,  as'”)  is 

a set  of  vectors  in  JR”,  z « max(*^,  ...,  s'”)  denotes  the  vector  in  f?'*  with  components 

7^  - max{  xF  i-f, ...,  n.  A natural  generalization  to  the  notion  of 

contracting  operators  is  given  in  the  following. 

Oofiniiion  5.2:  i 

■i 

An  operator  F from  into  R'*  is  an  m~eontracting  operator  on  a subset  D of 

R'*  if  there  exists  a non-negative  nxfi  matrix  A with  spectral  radius  less  than  unity 
satisfying,  for  all  ae^, y^, ...,  y"*  in  0, 

\F(x^ i /I  max(|»*-y^| |af'”-y'”|]  . i 

The  matrix  A will  be  called  a contracting  matrix  (or  the  operator  F.  I 

When  m - i,  the  preceding  definition  corresponds  exactly  to  Definition  3.2,  and 
m-contracting  operators  have  all  the  properties  we  have  already  mentioned  for 

i t 

I 

L 
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contracting  operators.  In  particular,  it  is  clear  from  the  definition  that  m-contractlng 
operators  are  continuous  and,  in  fact,  uniformly  continuous  on  O'”.  The  uniqueness  of  a 
fixed  point  in  0 is  also  easily  derived.  In  addition,  if  we  assume  that  0 is  a closed  subset 
of  If?'*  such  that  is  a subset  of  D,  then  we  arc  guaranteed  the  existence  of  a fixed 

point  in  D:  the  fixed  point  is,  for  example,  obtained  as  the  limit  of  the  sequence  x(J), 

j - 0,  1 defined  by: 

x<j)  - F(x(j-1),  ....  x(j-m))  , j “ m,  m*i,  ._  , 
which  is  independent  of  the  set  of  starling  vectors  x(0) x(m-t)  in  D. 

Wo  are  now  able  to  state  the  analogue  of  Theorem  4.1  for  m-contracting  operators 
in  the  following. 

Theorem  5.1  s 

If  F is  an  m-conlracling  operator  on  a closed  subset  0 of  IF?'*  such  that  F(D"^)  is 
a subset  of  D,  then  any  asynchronous  iteration  with  memory  corresponding  to  the 
operator  F and  starting  with  an  arbitrary  set  of  m vectors  In  D converges  to  the 
unique  fixed  point  of  F in  D. 

Proof: 

With  slight  modifications,  the  proof  of  this  theorem  is  identical  to  the  proof  of 
Theorem  4.1.  ■ 

5.2  - Examples  of  asynchronous  iterations  with  memory 

In  the  beginning  of  this  section,  we  considered  the  Asynchronous  Newton's  method  to 
find  the  simple  root  f of  a non-linear  operator  C.  This  method  led  to  the  sequence  of 
iterates  generated  by  the  asynchronous  Iteration  with  memory  (H,{r(0),x(0}],J,yd),  where: 

Jj  “ {!» •••»  tor  j m 2,  3, ... , 

s^Uj)  " J-I  , Si^(j)  m kj  for  j - 2,  3, ...  and  t n . 

In  addition,  as  the  operator  H can  easily  be  shown  to  be  a 2-contracti.ng  operator 
(assuming,  for  example,  some  Lipschitz  condition  for  the  derivative  of  C in  a small 
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neighborhood  of  the  root  f),  we  see  that  the  ncqucrtcc  defined  by  equation  (5.1)  converges 
to  f,  provided  that  kj  tends  to  infinity  with  j (which  simply  states  the  fact  that  the 
processes  eventually  complete  each  step  of  their  computations). 

Let  F be  an  operator  from  [#?"]'”  into  If?",  and  let  o be  a positive  scalar.  Consider 
the  operator  F^  from  into  IR”  obtained  from  the  operator  F by  the  Introduction 

of  the  relaxation  factor  o,  and  defined  as 

F^(x^,  x^, a:"V  « (l-a)x^  ♦ oF(x^, . 

We  first  note  that  both  F and  F^  have  the  same  fixed  points  (if  any).  We  also  note  that,  if 
F Is  an  m-contracting  operator  on  some  subset  D of  JR'*  with  the  contracting  matrix  A, 
then,  for  all  x^, ....  y®,  y^,  y"*  in  D,  the  operator  F^  satisfies: 

\rjx<^ x'^)-FJyO,  y"*)|  s |i-«||*0-y‘’|  ♦ o\F(x^ x"*)-Ffy^  y'^)\ 

S |l-o||*®-y®|  ♦ c>/lmax(|*^-y^|,  ...,  |*'”-y'”|] 

S ♦ o/tJmaxIlAyfl),  |r^-y^| \x'^-y'^\]  , 

and,  provided  that  0 < o < 2/ll*p(A)],  F^  is  an  ('m+f)-contracting  operator  on  D with  the 
contracting  matrix  A^  » |f"CjU  ♦ This  reestablishes,  in  a more  general  setting,  the 

i . 

result  mentioned  in  Section  3.2  for  asynchronous  iterative  methods  with  relaxation. 

In  [42],  Miellou  introduced  a generalization  of  the  idea  of  iterations  chaotiques  a 
retards  for  the  problem  of  finding  the  fixed  point  of  an  operator  F from  into  JR'*.  His 

generalization  is  a particular  case  of  an  asynchronous  iteration  with  memory 
corresponding  to  the  operator  F (with  ni  <•  2).  Miellou,  in  addition,  gives  convergence 
results  under  different  assumptions  on  the  operator  F (monotony,  continuity  and  existence 
of  a fixed  point). 

Many  more  examples  of  asynchronous  iterations  with  memory  can  be  given  and.  In 
particular,  all  classical  iterative  method  with  memory  can  be  expressed  in  this  way.  In 
addition,  all  usual  super-linear  iterative  methods  with  m memories  can  be  shown  (under 
weak  conditions)  to  correspond  to  some  fm*i)-contracting  operator,  therefore  ensuring  the 
convergence  of  any  asynchronous  iterations  corresponding  to  this  operator. 
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6 - On  the  complexity  of  asynchronous  iterations 

Let  r be  an  operator  from  to  itself  with  a fixed-point  f and  satisfying  the 

assumptions  of  Theorem  ^.1.  We  now  investigate  some  measures  of  complexity  for  the 
convergence  of  the  asynchronous  iteration  (F,x(0)^,y6)  toward  the  fixed-point  f of  F. 

We  witt  first  derive,  in  Section  6.1,  results  applicable  to  asynchronous  Iterations  In 
general,  then,  in  Section  6.2,  using  condition  (b’)  in  Definition  2.1,  we  will  derive  more 
specific  results  for  the  particular  rase  of  chaotic  iterations. 

The  constructive  proof  of  the  theorem  already  provides  us  with  bounds  for  the  error 
vector  x(j)  - j".  And,  in  fact,  if  F is  a contracting  operator  with  the  contracting  matrix  A, 
we  note  that  an  estimate  of  the  error  committed  with  the  asynchronous  iteration 
(F,x(0),f},/6)  is  directly  obtainable  from  the  asynchronous  iteration  (A,\x(0)-S\„7,/(S).  This 
estimate  is  used  in  this  section  to  derive  bounds  for  the  complexity  of  asynchronous 
iterations  corresponding  to  contracting  operators.  However,  since  (A,\x(0)-iU,/i)  can 
only  reflect'  linear  convergence,  this  estimate  is  certainly  not  adequate  to  deal  with  all 
asynchronous  iterations,  and,  in  Section  8,  using  an  example,  we  present  an  analysis  for  an 
asynchronous  iteration  with  super -linear  convergence. 

For  convenience,  we  only  consider  the  convergence  in  norm  of  the  error  vector 
x(J)  - I".  By  choosing,  for  example,  the  norm  ||r||  - max{  \x^\  | t - f, ...,  n ),  this 
corresponds  to  the  worst  possible  case  for  the  convergence  of  the  components. 

To  measure  the  linear  convergence  of  the  sequence  x(j),  J ■ 0,  f, ...,  toward  Its  limit 
f,  we  consider  the  following  complexity  measures  often  referred  to  in  the  Uleraturo.  The 
rate  of  convergence  of  the  sequence  is  defined  as: 

A - Urn  infy^^  [(-\ozU(j)-W/j] . 

In  addition.  If  cj  is  the  cost  associated  with  the  evaluations  of  the  first  j iterates, 
x(l),  ...,  x(j),  we  define  the  complexity  of  the  sequence  by: 


£ - lim  infy^„  U-logllxOl-fllVcy] . 
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If  all  togaritl-ims  arc  taken  to  the  base  JO,  i/P.  measures  the  asymptotic  number  of  steps 
required  to  divide  the  error  by  a factor  of  JO,  whereas  J/E  measures  the  corresponding 
cost.  We  note  that,  if  c j/J  tends  to  some  finite  limit  c (which  corresponds  to  the  average 
cost  per  step),  then  the  complexity  is  simply  given  by  £ - P/e. 

The  costs  cj,  j « J,  2, can  be  chosen  according  to  any  convenient  measure.  In  our 
case,  we  consider  the  cost  to  corresportd  either  to  the  number  of  evaluations  of  the 
operator  F,  or  to  the  lime  to  perform  the  evaluations.  In  the  former  case,  if  each 
component  Is  equally  as  hard  to  compute,  the  cost  can  be  directly  evaluated  from  the 
sequence  3 t>y  considering 

♦ UylVn,  (6.1) 

whore  |J^|  is  the  cardinality  of  the  set  Jj,  i.  e.,  the  number  of  components  evaluated  at  the 
>-th  step  of  the  iteration.  In  the  latter  case,  the  cost  is  belter  suited  to  deal  with  parallel 
algorithms,  and  can  be  evaluated  through  the  classical  tools  of  queueing  theory.  When  it 
is  necesnai  y to  indicate  which  cost  measure  is  used  in  the  evaluation  of  the  complexity, 
we  use  the  notations  £^  if  the  cost  is  mea.sured  in  number  of  evaluations  of  F,  and  £j  if 
the  cost  is  measured  by  the  time  needed  to  perform  (sequentially)  one  evaluation  of  F. 

6.1  - General  bounds:  asynchronous  iterations 

We  return  to  the  proof  of  Theorem  4.1,  and  we  use  the  same  notations.  The  proof 
simply  consists  of  constructing  an  increasing  sequence  of  Indices  J^,  p m 0,  I,  ...,  satisfying 
||*fy)  - fll  s ofcaP  for  J>  jp, 

where  the  positive  constant  u can  bo  taken  to  be  « - 1|*(0)-J'||.  From  the  construction  of 
this  sequence  we  note  that 

•fp*/  • >p  * '■p  * *p 

whore  and  are  integers  chosen  to  satisfy:  (1)  starting  with  the  index  Jp*>’pt  «tt 
evaluations  of  iterates  do  not  make  any  more  use  of  values  of  components  corresponding 
to  iterates  with  indices  smaller  than  and  (2)  all  components  arc  evaluated  at  least  once 
between  the  0'p*rp)-th  and  the  (Jp*f‘p*tp)-th  iterates. 
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Now  let  r 

pj  - sup{  p \ tq  * tQ  * .„  * r^,i  * i j } for  j - 0,  I (6.2) 

Then,  If  we  Know  and  for  p » 0,  /, we  can  deduce  a bound  on  \\x(J)-S\\  since 

ll*0^fll  s <or  y - 0,  1, ... , 

which  shows  that  the  sequence  x.(J),  j > 0,  I converges  at  least  as  fast  as  the  sequence 

0^4  j " 0,  I,  ...,  with  a rate  of  convergence  ^ such  that 
- [lim  infy^^  (Pj/j>]  logo  . 

And,  If  c j is  the  cost  associated  with  the  evaiuations  of  the  first  j iterates,  we  have  the 
following  bound  for  the  complexity: 

I 

£ i - [lim  infy_,^  (Pj/cj)]  logo  . 

In  addition,  as  was  noticed  earlier,  if  /i  is  a contracting  matrix  for  the  operator  £,  o can  be 
chosen  arbitrarily  close  to  p(/\).  This  shows  that  in  the  bounds  we  have  jusl  obtained  we 
can  simply  replace  o by  p(A),  and  this  yields  the  following. 

Thoorom  6.1 : 

Let  F satisfy  the  condition  of  Theorem  <1.1,  and  let  ^ be  a contracting  matrix  for 
the  operator  F.  Then  the  asynchronous  iteration  (F,x(0)^,/6)  converges  to  the  fixed 
point  of  F with  a rate  of  convergence 

^ i - [lim  Infy^j^  (Pj/j)]  iogp(A)‘, 
and  a complexity 

£ i - [lim  infy^jp  <Pj/cj)]  logpM) , 
where  the  sequence  pj  is  defined  from  J and  xl  by  equation  (6.2). 

An  example 

As  an  illustration,  we  consider  the  parallel  implementation  of  Jacobi's  method  with  k 
processes.  For  simplicity,  we  assume  that  n is  a multiple  of  k,  and  we  set  q •>  n/k. 


To  avoid  an  overhead  in  the  selection  of  the  components  to  be  updated  at  each  step 
of  the  Iteration,  each  process  is  assigned  to  the  evaluation  of  a fixed  subset  of  the 
components.  In  particular,  when  all  components  arc  equally  as  hard  to  compute,  and  when 
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all  processors  are  equally  as  fast,  it  is  nalural  to  decompose,  the  set  of  components  into 
subsets  of  equal  sizes,  and,  for  example,  to  assign  the  first  process  to  the  evaluation  of 
the  first  q components,  the  second  process  to  the  evaluation  of  the  next  g components,  and 
so  forth.  Corresponding  to  this  decomposition,  a parallet  implementation  of  Jacobi’s 
method  with  k processes  can  be  represented  by  the  asynchronous  Iteration  (r,x(0),J,^), 
where  J and  ^6  are  defined  by: 

Jj"{i\i*  (J-I  mod  k)q  i i i q * (j-1  mod  k)q  } for  J m l,  2,  ...  , 

s^(j)  - [(J-l)/k]q  for  J m 1,2, .-  and  t - 1, ...,  n . 

The  two  asynchronous  iterations  we  introduced  in  Section  2.2  to  represent  Jacobi’s 
method  correspond  to  the  particular  cases  k - 1 and  k - n. 

It  is  easy  to  check  that  and  are  given  by  f and  k,  respectively,  for  p » 0,  I 

This  shows  that  pj  - [J/k]  and  therefore 

^(k)  i -aogp(A»/k  . 

Now,  if  Cj  measures  the  number  of  evaluations  of  T required  to  compute  the  first  j 
iterates,  using  equation  (6.1),  we  haVe  cj  - j/k.  This  gives  lor  the  complexity; 

E^(k)  i - logpf/l) . • (6.3) 

For  all  values  of  k,  we  obtain  the  same  bound  for  the  complexity.  In  particular,  when  F is 
the  linear  operator  defined  by  F(x)  ^ Ax  * b,  whore  ^ is  a non-negative  nx«  matrix  with 
spectral  radius  less  than  unity,  then  A can  be  chosen  as  a contracting  matrix  lor  F and  the 
bound  (6.3)  is  known  to  be  sharp. 

Since  the  asynchronous  iteration  we  are  considering  corresponds  to  a parallel 
implementation  of  Jacobi’s  method.  Instead  of  measuring  the  cost  by  the  number  of 
evaluations  of  F,  it  is  more  natural  to  use  the  average  time  to  perform  the  evaluations  as  a 
measure  of  the  cost.  Let  the  time  unit  bo  the  average  time  to  perform  (sequentially)  one 
evaluation  of  F.  Then,  if  pk  i j i (p*])k,  we  have  s cj  s and  Cpi^  - 

The  expression  Xf^/k  corresponds  to  the  time  for  the  k processes  to  execute  in  parallel 
their  computations  and  to  synchronize  their  executions.  The  factor  Xf^  is  the  penalty  factor 
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introduced  by  Rung  in  [37]j  it  measures  the  overhead  due  to  the  fluctuations  in  the 

computing  times  of  the  k processes,  and  can  be  evaluated  if  we  know,  for  example,  the 

distribution  function  for  the  time  to  evaluate  F.  In  particular,  we  have  Xj  1 and,  for 
k i 2,  i I with  the  equality  only  when  it  always  take  the  same  constant  time  to 

evaluate  F (i.  e.,  there  are  no  fluctuations  in  the  computing  time).  This  cost  measure 

yields  the  following  bound  for  the  complexity: 

Ef(k)  i -[k/X,^][onp(A) . 

Again,  these  bounds  are  sharp  for  the  linear  operator  we  mentioned  above,  and  the  ratio 
Ef(k)/Ef(l}  - k/X/^  measures  the  speed-up  achieved  by  using  a parallel  implementation 
with  k processes.  We  would  expect  the  implementation  with  k processes  to  be  k times  as 
efficient  as  the  sequential  implementation  (with  k - 1),  but  this  is  not  so  because  of  the 
overhead  introduced  by  synchronizing  the  k processes  and  measured  by  the  penalty 
factor  Xj^. 

6.2  - Additional  assumptions:  chaotic  iterations 

In  the  preceding  example,  we  have  been  able  lo  carry  out  the  analysis  for  Jacobi’s 
method  (and  even  obtain  sharp  bounds  on  the  complexity)  because  the  representation  in 
terms  of  asynchronous  iterations  is  known  explicitly  and  follows  a very  regular  pattern. 
This  is  not,  however,  generally  so.  For  example,  in  a parallel  implementation  with  several 
processes  using  no  synchronization  (as  presented  in  Section  2.1),  the  sequences  /d  and  ^ 
{and,  therefore,  the  sequences  and  p - 0,  1,  _.)  are  not  known  directly  but  are  only 
defined  implicitly  by  the  processes  in  the  course  of  their  executions. 

Below,  we  present  alternate  bounds  for  ^ and  E under  conditions  often  satisfied  in 
usual  implementations  of  asynchronous  Iterations.  We  assume  that  we  know  bounds  on 
and  tp,  and  we  restrict  the  definition  of  the  class  of  asynchronous  iterative  methods  by 
replacing  conditions  (b)  and  (c)  of  Definition  2.1  with  the  following: 

(b’)  There  exists  a positive  Integer  r such  that,  for  / - 1,  2, ...  and  i-1 n, 

*i(J>  > J-r, 
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(c’)  there  exists  a non-negative  integer  t such  that,  for  } m 1,2,  ...» 
Jj  U ...  U - {/ n). 

As  was  already  mentioned,  condition  fb')  was  proposed  by  Chazan  and  Miranker  in  the 
definition  of  the  chaotic  relaxation  scheme  [11].  Although  the  convergence  result  obtained 
under  condition  (b)  of  Definition  2.1  is  mathematically  more  satisfactory,  condition  (b’)  is 
very  often  satisfied  in  practical  applicalions,  in  particular,  when  the  computations  of  all 
components  have  the  same  complexity  (which  Is  the  case  with  a linear  operator). 
Condition  (c’)  is  also  satisfied  for  most  of  the  usual  Implementations  of  asynchronous 
iterations,  since  it  is  natural  that  (1)  a process  evaluates  a component  by  using  the  most 
recently  updated  values  of  all  components;  and  (2)  two  processes  never  evaluate  the  same 
component  at  the  same  time;  in  this  case  it  follows  directly  that,  by  taking  r = t*I, 
conditions  (b')  and  (c’)  are  equivalent. 

Under  the  additional  conditions  (b')  and  (c'),  we  clearly  have  ^ ^ '"id  s t,  for 

p - 0,  1 and,  therefore,  pji  lJ/(r*t)\.  From  the  bounds  stated  In  Theorem  6.1,  we 

immediately  obtain  the  following. 

Corollary: 

Lei  F satisfy  the  condition  of  Theorem  4.1,  and  let  A be  a contracting  matrix  for 
F.  If  the  asynchronous  Iteration  (r,x(0),^,^)  satisfies  the  additional  conditions  (b') 
and  (c’),  then  it  converges  to  the  fixed  point  of  F with  a rate  of  convergence 
A i - [l/(r*t)]  [ogp(A)  , 
and  a complexity 

F i - [limy^gj,  j/(r*t)cj]  [ogp(A) . 

7 - Experimental  results 

The  results  of  this  section  are  reported  in  detail  in  Chapter  V.  A very  brief 
presentation  is  given  below  as  an  immediate  illustration  of  asynchronous  iterative 


methods. 
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Several  asyoclironous  iterations  have  been  experimented  with  on  C.mmp,  the 
Carnegie -Mellon  multiprocessor  [63],  thny  are  described  in  Section  7.1,  and  the  actual 
measurements  arc  presented  in  Section  7.2.  Although  asynchronous  iterative  methods  are 
applicable  to  non-linear  problems,  the  experiments  reported  here  deal  only  with  linear 
problems.  More  specific  irealmcnts  for  non-linear  problems  will  be  reported  elsewhere. 

7.1  - Experiments  with  asynchronous  iterations 

All  asynchronous  iterations  we  have  experimented  with  consist  of  the  parallel 
execution  of  k processes.  As  we  did  with  the  parallel  implementation  of  Jacobi’s  method, 
we  assign  to  each  of  the  processes  the  evaluation  of  a fixed  subset  of  the  components. 
Each  process  computes  cyclically  new  values  for  the  components  in  its  subset,  and  the 
methods  only  differ  by  the  choicjs  nf  the  values  used  in  the  evaluations. 

Atynchronous  Jacabi’s  method  (AJ):  For  the  evaluations  of  all  components,  a process 
uses  only  values  of  the  components  kriown  at  the  beginning  of  a cycle,  and  the 
process  releases  all  new  values  at  the  end  of  each  cycle. 

Asynchronous  Causs-Seidel's  method  (AGS):  Same  as  the  AJ  method  except  that  the 
process  uses  new  values  of  the  components  In  Its  subset  as  soon  as  they  are 
known  for  further  evaluations  in  the  same  cycle.  Again,  it  releases  the  new 
values  (for  the  other  processes)  at  the  end  of  its  cycle. 

Purely  Asynchronoiu  method  (PA);  A process  computes  the  new  values  of  each 
component  by  using  the  most  recent  values  of  all  components  and  releases  each 
new  value  immediately  after  its  evaluation. 

The  PA  method  is  certainly  the  easiest  method  to  implement,  and,  as  far  as  space  Is 
concerned,  is  clearly  the  most  efficient  one,  whereas  the  AJ  method  is  the  worst  one,  since 
It  requires  from  each  process  not  only  a complete  duplication  of  all  components  (as  of  the 
beginning  of  its  cycle)  but  still  another  copy  of  the  components  In  Its  own  subset.  This 
can  hardly  be  justified  but  experimental  results  give  useful  comparisons  between  the  AJ 
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rriRthod  and  the  actual  Jacobi's  method  (also  between  the  AGS  and  Gauss-Seidd's 
methods). 

In  addition,  both  Ihc  AJ  and  AGS  mclhods  also  require  the  need  for  a critical  section 
in  order  to  read  alt  components  at  the  beainning  of  a cycle  and  to  update  the  values  at  the 
end  of  a cycle,  whereas  no  critical  section  is  needed  with  the  PA  method.  However,  C.mmp 
has  the  drawback  that  no  indivisible  instructions  exist  to  read  or  write  floating  point 
numbers  (implemented  on  two  consecutive  words  of  memory),  therefore,  If  we  are  lo 
implement  the  PA  method  on  C.mmp,  only  the  first  8 bits  of  the  mantissa  can  be  considered 
significant,  and  the  admissible  error  in  the  termination  criterion  has  to  be  chosen 
accordingly. 

7.2  - Results 
/ 

The  three  methods  just  described,  as  well  as  Jacobi's  method,  have  been 
implemented  on  C.mmp  to  solve  the  Dirichlet  problem  for  Laplace's  equation  on  a 
rectangular  domain  of  IP?.  Using  Ihe  method  of  finite  differences,  an  approximate  solution 
to  this  problem  can  be  found  by  solving  a linear  system  of  equations.  In  the  experiments 
reported  here,  a regular  grid  has  been  chosen  with  PJx24  Interior  points,  resulting  in  a 
linear  system  of  size  n « SO-f.  This  system  can  be  represented  in  the  form 
X » F(x)  - Az  * b,  where  the  vector  6 is  obtained  from  the  boundary  conditions,  and  the 
matrix  A is  a (very  sparse)  non-negative  matrix  with  spectral  radius  p(A)  - 0.99t.  Since 
p(\A\)  n p(A)  < I,  this  shows  that  A is  a contracting  matrix  for  the  operator  f,  and, 
therefore,  that  the  result  of  Theorem  4.1  can  be  applied  to  F to  ensure  the  convergence  of 
each  iterative  method. 

At  the  time  the  measurements  have  been  taken,  the  configuration  of  C.mmp  Included 
six  processors,  and  all  iterative  methods  have  been  run  with  a number  of  processes 
k • 1,  2,  3,  4,  and  6.  Each  of  the  results  uporled  here  is  the  average  of  three 
measurements,  but,  since  C.mmp  was  used  in  stand-alone  mode  during  the  experiments, 
very  little  difference  was  noted  from  one  run  lo  the  next. 
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Jn  Table  7.1,  we  report  for  the  four  methods  the  average  number  of  vector 
evaluations  required  to  reduce  (asymptotically)  the  error  vector  by  a factor  of  10:  this 
corresponds  to  the  cost  measure  I/Eg.  And,  in  Table  7.2,  we  report  the  average  time 
(expressed  in  seconds)  required  to  achieve  this  reduction:  this  corresponds  to  the  cost 
mesure  I/E^. 


The  bounds  obtained  from  the  results  of  the  previous  sections  are  mentioned  in 
parentheses  along  with  the  measuromonts.  The  parameters  in  these  bounds  have  been 
evaluated  either  directly  (e.  g.,  p<A)  ••  0.991),  or  Ihrough  measurements  by  tracing  the 
executions  of  the  processes.  In  particular,  for  the  AJ,  AGS  and  PA  methods,  the  bounds  r 
and  C,  defined  in  Section  6.2,  have  been  determined  by  observing  the  sequencing  of  the 
tasks  performed  by  the  different  processes.  Similarly,  the  penalty  factor  in  Jacobi’s 
method  and  the  overhead  due  to  the  critical  section  in  the  AJ  and  AGS  methods  have  been 
obtained  by  direct  measurements:  they  arc  presented  in  Tables  7.3  and  7.4. 


Jacobi 

AJ 

AGS 

PA 

k ^ 1 

254  (254) 

254  (254) 

127  (254) 

127  (254) 

k ~ 2 

254  (254) 

266  ((188) 

142  (888) 

127  (762) 

k m3 

254  (254) 

267  (846) 

149  (846) 

127  (762) 

k m 4 

254  (254) 

273  (825) 

166  (825) 

129  (762) 

k m 0 

254  (254) 

285  (804) 

196  (804) 

128  (762) 

Table  7.1  - Number  of  evaluations  required  to  divide  the  error  by  a factor  of  10 


Jacobi 

AJ 

AGS 

PA 

k - 1 

337  (337) 

337  (337) 

168  (337) 

168  (337) 

k m 2 

241  (241) 

211  (70S) 

113  (705) 

84  (506) 

k m 3 

178  (178) 

149  (471) 

83  (471) 

56  (337) 

k m 4 

153  (153) 

123  (372) 

75  (372) 

43  (253) 

k m6 

131  (131) 

102  (289) 

70  (289) 

28  (169) 

Table  7,2  - Time  required  lo  divide  the  error  by  a factor  of  10 
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k m 1 

k - 2 

k - 3 

k .4 

fc  a-  6 

h 

1 

1.43 

I.S9 

1.82 

2.34 

7. 

0 

29.9 

37.1 

45.1 

57.3 

Table  7.3  - Penalty  farlor  with  Jacobi’s  method 
and  percentage  ot  lime  wasted 


k - 1 

k - 2 

k 3 

k ~ 4 

fe  « 6 

1 

1.20 

1.26 

1.35 

1.62 

7. 

0 

16.6 

20.8 

26.0 

38.2 

Table  7A Critical  section  overhead  cost  with  the  AJ  and  AGS  methods 
and  percentage  of  lime  wasted 

These  results  must  only  bo  considered  to  illustrate  the  behavior  of  asynchronous 
iterations,  since,  in  particular,  the  two  cost  measures  reported  in  Tables  7.1  and  7.2 
strongly  depend  on  both  the  problem  (i.  o.,  the  matrix  A)  and  the  multiprocessor  system. 
Yet,  they  show  a clea."  advantage  of  asyt>chronous  methods  over  synchronized  methods. 

We  note,  for  example,  from  Table  7.3  that,  with  Jacobi’s  method,  when  « 6 
processes  are  used,  the  penalty  factor  is  as  big  as  Xg  « 2.34.  This  means  that  about  57 
percent  of  the  time  is  spent  by  a process  waiting  for  the  other  processes  to  finish  their 
compulations.  This  limits  the  possible  speed-up  to  2.6  rather  than  6. 

We  also  note  that  the  use  of  critical  sections,  too,  should  be  avoided,  since,  with  the 
AJ  or  AGS  methods,  when  6 processes  are  used,  about  38  percent  of  the  time  is  spent 
^ waiting  for  entering  the  critical  section,  again  limiting  the  possible  speed-up  to  3.7  rather 

[j  than  6. 


The  measurements  for  the  PA  method,  on  the  other  hand,  indicate  that  we  achieve  an 
almost  full  speed-up  with  this  method  (at  least  with  a small  number  of  processes).  An 
obvious  reason  for  this  speed-up  is  the  total  absence  of  any  form  of  synchronization! 
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another  roation,  specific  to  the  problem  we  have  experimented  with  and  Indicated  by  the 
results  of  Table  7.1,  is  due  to  the  sparsity  of  the  matrix  A 

The  bounds  derived  in  Section  6 have  been  obtained  in  a very  general  case.  Yet 
Tables  7.1  and  7.2  show  that  they  are  always  within  a factor  between  3 and  6 of  the  actual 
measurements  (except  for  Jacobi’s  method  whore  they  are  sharp).  In  addition,  we 
certainly  could  obtain  much  sharper  bounds  by  carrying  out  the  analysis  for  the  specific 
problem  wo  have  experimented  with  (for  example,  by  taking  into  account  the  sparsity  of 
the  matrix).  In  particular,  a specific  analysis  for  the  PA  method  can  easily  explain  the  fact 
that  ]/£g  is  almost  independent  of  the  number  of  processes  (see  Table  7.1). 

8 I-  Asynchronous  iterations  with  super-linear  convergence 

As  we  already  noticed,  the  bounds  established  in  Section  6 are  certainly  not 
adequate  to  measure  the  complexity  of  iterations  with  super -linear  convergence.  In  this 
section,  we  use  as  an  example  the  iterative  method  we  have  mentioned  at  the  beginning  of 
Section  5 to  show  how  an  analysis  of  the  complexity  can  be  done  for  this  case. 

To  study  the  convergence  of  a sequence  i(j),  j - 0,  I, ...,  toward  its  limit  f,  we  now 
use  the  following  usual  measures  of  complexity.  The  order  of  convergence  is  defined  as 
P - Urn  inf^.,^  K-log||rO>fl|y/>] . 

and,  as  before,  if  cj  is  the  cost  associated  with  the  evaluations  of  the  first  J Iterates, 
...,  x(j),  we  define  the  complexity  of  the  sequence  by: 

£ - lim  infy_,^  [(log-log|l*0;-flP/c^) , 

Again,  we  note  that,  if  the  average  cost  per  step  cj/j  lends  to  some  finite  limit  t when  j 
tends  to  infinity,  the  complexity  is  simply  given  by  f » In  the  remainder  of  the 

section,  we  assume  that  the  limit  tr  exists. 

In  order  to  find  the  simple  root  J”  of  an  operator  C from  IR”  into  llself,  we  use  the 
Asynchronous  Nawton's  method,  AN,  as  implemented  by  the  two  processes  described  at  the 
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boginning  of  Section  5.  Let  r^,  i m I,  2,  be  the  number  of  iterates  evaluated  by  the  first  | 

1 

.process,  P.j,  during  the  »-th  evaluation  of  the  derivative  C by  the  second  process,  P2.  Let  j 

i 

Jq  b 0 and  = rj  * ...  * r^,  for  i 1,  2,  ....  then  x(j^),  i « 0,  i,  ...,  is  the  iterate  used  by  P2  j 

for  the  TA+iJ-st  evaluation  of  the  derivative.  Starting  with  the  two  initiai  values  z(0)  and 
C'(x(0)),  the  AN  method  generates  with  the  two  processes  Pj  and  P2  the  sequence  of  j 

iterates  x(j),  j » 1,  2,  ....  defined  by  | 

x(j*l)  “ xXj)  - {G'(x(j^,^))]~^G(x(j}) , for  i - 2,  ...  and  ii<  j i . (8.1)  | 

i 

1 

The  following  theorem  gives  the  mea.surcs  of  complexity  for  this  sequence  if  we  i 

know  come  bounds  on  the  sequence  i « 1,  2, .... 

Theorem  8.1 : 

Let  the  initial  approximation  x(0)  be  close  enough  to  the  root  that  is 
x(Q)  C rr  { X \ llr-fll  < < } , 

and  let  the  derivative  G'  satisfy  some  Lipcchitz  condition  on  D^t 
llC’fx)-G‘(y)l}  i Mllx.-y}\  . ix,yCD^. 

If  e satisfies  the  condition 
AfllCYp-^lle  < 2/5  , 

and  if  there  exist  come  positive  integers  p and  q such  that 
p K r^S.  q , lor  i - 1,  2 

then  the'  order  of  convergence,  •’f'd  the  complexity,  E,  of  the  sequence  defined  by 
equation  (8.1)  satisfy: 

A i , (8.2) 

and 

£■  2 aagX^)/(qe)  , (8.3) 

whore  Is  the  largest  root  of  the  equalion  - (p-J)i  - J ■ 0 (for  which  we  can 

check  easily  that  0.4  * Vp  < < 0.5  * Vp,  p • I,  2, ...). 

Proof: 

The  proof  Is  ea.sy  but  technical,  anri  below  we  only  give  an  outline  for  this  proof. 
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Let  a-  - M||CY?r^||,  ami  let  c - 3a/[2(I-ofc)].  From  the  clioice  of  c,  we  first  note  that, 
starting  with  x(0)  C D,,  the  sequence  Ilr0■^fll.  j - 0,  I, is  strictly  decreasing  and 
satisfies: 

s cllx0j_2^-flllk0V-fll  , for  t - 2,  3,  , 

and 

ll*0♦i^rll  j:  II  , for  i - 2,  3, ...  and  <J<  . 

By  substitution,  it  follows  that,  for  i » 2,  3, ..., 

ii*Oi*i^.Tii « c'’‘ii*Oi-j^rir‘'^i*Oi.2>-fiiii*o-i>-rii . 

and,  if  we  set  - -logc||*0^>-f  ll,  we  obtain: 

^ * ^H-2  > for  t » 2,  3 

Therefore,  by  using  the  lower  bound  on  r-,  we  deduce  that 
i 11^  ♦ (p-J)u^_j  * u^.2  . for  i - 2,  3,  ...  . 

This  shows  that  Uj  lends  to  infinity  at  least  as  fast  as  Therefore,  the  order  of 

convergence,  p',  of  the  subsequence  x(j^),  i « 0,  J, ...,  must  verify  p'  a Xp.  The  bounds 
(8.2)  and  (8.3)  are  derived  directly  from  this  last  inequality.  B 

In  parlicular,  if  Ihe  cost  cj  measures  the  number  of  evaluations  of  the  operator  C, 
we  simply  have  cy  ■ v',  and,  therefore,  2 (togXp)/q.  On  the  other  hand,  if  the  cost 
corresponds  to  the  execution  time,  Ihc  complexity  will  depend  on  the  implementation 
itself.  For  example,  an  implementation  corresponding  strictly  to  the  generation  of  the 
sequence  described  by  equation  (8.1)  requires  the  use  of  a critical  section  for  reading  and 
writing,  in  a block,  the  values  of  the  iterates  and  of  the  derivative.  The  use  of  a critical 
section  introduces  an  overhead,  bvil,  as  is  done  with  the  PA  method,  the  overhead  can  be 
avoided  if  a process  uses  whatever  values  are  currently  available  when  needed.  In  this 
case  the  bounds  of  Theorem  8.1  still  holds,  and  t can  be  given  the  value  er  - 

The  parameters  p and  q,  too,  depend  on  the  particular  implementation  of  the  AN 
method,  and,  especially,  on  the  relative  r.peeds  of  the  processors  executing  the  processes 
Pi  and  P2-  In  practice,  if  the  processors  are  equally  as  fast,  we  expect,  with  small 
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variiitions,  to  be  close  to  n,  and  the  values  p - q - n can  predict  good  estimates  (or  the 
complexity  of  the  AN  method  Implemented  with  two  processes. 

The  AN  method  is  easily  gcneralizable  to  more  than  two  processes.  If  k processes 
are  available,  might  be  assigned  to  the  evaluation  of  the  sequence  of  Iterates,  while 
k2  •>  k - kj  are  assigned  to  the  evaluation  of  the  derivative.  The  bounds  of  Theorem  8.1 
still  holds  for  this  case  as  well,  only  with  different  values  for  the  sequence  r^,  i ” I,  2,  ... 
(or  for  the  bounds  p and  q),  determined  by  the  parallel  implementations  of  the  Iwo 
evaluations.  Further  results  in  this  direction  will  be  reported  elsewhere. 

9 - Extensions  of  the  results’ 

We  mention  below  some  direct  extensions  of  the  results  presented  in  this  chapter 
and  some  points  subject  to  further  development. 

• 

A straighforward  generalization  of  the  results  can  be  obtained  if,  Instead  of  /R”,  we 
consider  the  product  P of  n Danach  spaces  with  norms  j.)^,  i » 1, ....  n.  In  this  case,  if  * 
Is  an  element  of  P,  x Is  determined  by  its  components  x^  C « •>  J, ...»  n.  And  |*| 
represents  the  non-negative  vector  of  IR"  with  components  i - 1, ....  n. 

Considering  only  the  class  of  linear  operators,  F(x)  ^ Ax  * b,  we  have  noted  that  the 
notion  of  contracting  operators  coincides  with  the  condition  that  p(\A\)  < 1.  In  [11], 
Chazan  and  Miranker  have  shown  that  this  condition  is  not  only  sufficient  but  also 
necessary  for  the  convergence  of  all  chaotic  iterations.  This  Implies,  in  particular,  that  all 
asynchronous  iterations  corresponding  to  a linear  operator  F arc  convergent  If  and  only  if 
f is  a contracting  operator.  The  necessity  of  this  condition,  however,  seems  to  be 
inherent  to  the  linear  nature  of  the  problem,  and  when  we  also  consider  non-linear 
operators  the  proof  given  by  Chazan  and  Miranker  docs  not  apply  any  more.  It  would  be 
of  interest  to  obtain  conditions  on  the  class  of  operators  for  which  all  asynchronous 
iterations  are  guaranteed  to  converge.  Simitar  conditions  for  the  convergence  of  a more 
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restricted  class  of  iterations  would  also  be  of  interest,  in  particular,  for  the  subclass  of 
asynchronous  iterative  methods  corresponding  to  the  additional  assumptions  introduced  in 
Section  6.2. 

The  bounds  we  have  obtained  to  estimate  the  rate  of  convergence  of  asynchronous 
Iterations  have  been  derived  by  considering  the  worst  possible  case,  and,  compared  to 
actual  measurements,  these  bounds  are  very  conservative.  It  would  certainly  be  very 
useful  to  obtain  bounds  (or  estimates)  corresponding  to  the  average  behavior  of 
asynchronous  itcralions,  for  example,  given  the  probability  distributions  of  the  two 
sequences  ^ and  xt,  or,  more  generally,  given  the  distribution  functions  for  the  time  it 
takes  the  different  processes  lo  evaluate  the  components. 

We  have  already  mentioned  the  possibility  of  introducing  a relaxation  factor  in 
asynchronous  iterations,  and,  for  contracting  operators,  we  have  derived  a possible  range 
that  guarantees  the  convergence  of  all  asynchronous  iterations.  Nothing  is  known, 
however,  about  the  optimal  choice  of  the  relaxation  factor,  for  example,  given  directly  the 
asynchronous  iteration  through  ^ and  xS,  or,  again,  given  the  distribution  functions  for  the 
evaluation  times. 

10  - Concluding  remarks 

In  the  implementation  of  most  parallel  algorithms,  synchronization  seems  to  be 
required  to  assure  the  communication  between  the  processes,  and  to  guarantee  their 
correct  executions.  However,  the  main  drawback  with  synchronization  is  that  it  degrades 
considerably  the  performance  of  the  algorithms  because  It  is  very  time  consuming.  The 
class  of  asynchronous  iterative  methods  avoids  this  drawback.  It  includes  iterations 
corresponding  lo  a parallel  implementation  in  which  the  cooperating  processes  have  a 
minimum  of  intercommunication  and  do  not  make  any  use  of  synchronization.  The  Purely 
Asynchronous  method  described  in  Section  7.1  is  a typical  example  of  an  asynchronous 
Iterative  method.  Asynchronous  Iterations  follow  the  same  goal  as  chaotic 
relaxations  [1 1];  to  eliminate  the  need  for  synchronization  in  a parallel  computation. 
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Asynchronous  llerations  {jenerallzc  lo  asynchronous  iterations  with  memory  which 
allow  different  values  of  the  same  variable  to  be  used  within  the  same  computation.  Using 
the  notions  of  contracting  operators  and  of  m-contracting  operators,  Theorems  A.l  and  5.1 
slate  sufficient  conditions  lo  guarantee  the  convergence  of  any  asynchronous  iterations 
and  asytrehronous  iterations  with  memory.  These  conditions  are  satisfied  for  a large  class 
of  operators. 

In  the  second  part  of  the  chapter,  asynchronous  iterations  are  evaluated  from  a 
computational  point  of  view,  then  the  results  of  a series  of  actual  measurements  (obtained 
by  running  asynchronous  iterations  on  a multiprocessor)  are  presented.  These  results 
fully  justify  the  use  of  asynchronous  ileralive  methods. 

General  bounds  on  the  compleyily  of  asynchronous  Iterations  are  first  derived 
directly  from  the  proof  of  the  convergence  theorem.  Although  these  bounds  are  sharp  for 
a parallel  implemcnlalion  of  Jacobi's  method,  they  are  of  little  applicability  since  they 
require  to  know  a priori  Ihe  exact  specification  of  each  step  of  the  iteration.  Alternate 
bounds  are  then  derived  under  additional  conditions  which  are  usually  satisfied  in 
practical  applications.  These  bounds  are  consistent  with  actual  measurements’,  for  the 
experiments  we  have  run,  they  are  always  within  a factor  of  6 of  the  measurements.  In 
addition,  it  is  our  feeling  that  these  bounds  can  be  largely  improved  if  we  take  into 
account  specific  characteristics  of  the  problem  being  solved,  therefore  leading  to  a better 
understanding  of  asynchronous  Iterations,  In  Section  8,  for  example,  we  have  made  a first 
step  in  this  direction,  and  we  have  presented  an  analysis  for  the  Asynchronous  Newton's 
method. 

A series  of  experiments  has  been  conducted  on  C.mmp,  a multiprocessor  system 
(with  6 processors  at  the  time  the  experiments  have  been  run),  and  several  asynchronous 
iterative  methods  have  been  implemented  to  solve  a large  linear  system  of  equations. 
They  range  from  Jacobi's  method,  requiring  a full  synchronizalion  of  all  the  processes  at 
each  step  of  .the  Iteration,  to  the  PA  method,  which  requires  no  synchronization  at  all.  In 
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between,  the  A.I  and  AGS  methods  arc  derived  from  the  usual  Jacobis  and  Gauss-Seidel’s  » 

- methods,  and  they  require  the  use  of  a critical  section. 

The  experimental  results  show  a considerable  advantage  for  the  iterative  method  ' 

with  no  synchronization  at  all.  For  a number  of  processes  up  to  the  number  of  processors 
avaitable  on  C.mmp,  the  PA  method  exhibits  full  parallelism  and  has  an  optimal  speed-up 
compared  to  Gauss-Scidcl's  method,  the  best  sequential  method  experimented  with.  The 
AJ  and  AGS  methods  have  a very  similar  behavior,  and  when  6 processes  are  used  the 
overhead  caused  by  the  critical  section  implies  that  30  percent  of  the  time  a process  is 

i 

waiting  for  entering  the  critical  section.  As  is  intuitively  expected,  Jacobi’s  method  has  i 

the  worst  Ijehavior  of  all  the  methods  considered,  and,  with  6 processes,  the  overhead,  clue 
to  the  synchronization  of  all  the  processes  at  each  step  of  the  iteration,  is  about  57 
percent  (i.  e.,  more  than  half  the  time  a process  is  waiting  for  the  other  processes  to 
finish  their  compulations). 

a 

On  the  basis  of  these  experimental  results,  and  for  the  problem  we  have  considered, 
there  does  not  seem  lo  be  any  alternatives:  the  PA  method  is  obviously  the  most  efficient 
one.  In  addition,  another  advantage  of  the  PA  method  is  that  it  is  the  easiest  one  to 
implement,  and,  spacewisc,  it  is  also  the  most  efficient  one. 

Finally,  another  possibility,  which  has  only  been  outlined  in  this  chapter,  is  the 
introduction  of  a relaxation  factor.  Based  only  on  a few  experimental  results  (not 
reported  here),  it  is  our  belief  that  we  can  expect  an  Improvement  of  the  Purely 
■ Afynchronoits  Over-Relaxation  method  over  the  PA  method  similar  to  the  improvement  of 
the  SOR  method  over  the  Gauss -Seidel’s  method.  If  we  choose  the  relaxation  factor  in  an 
optimal  way.  The  optimal  choice  of  the  relaxation  factor  depends  not  only  on  the  system 
being  solved,  but  also  on  the  probability  distributions  of  the  various  execution  times  by 
the  different  processes. 


Chapter  IV 


i 

On  the  Alpha-Beta  Pruning  Algorithm 
Part  1:  The  sequential  algorithm 


1 *•  Introduction 

Most  so-called  intelligent  programs  use  some  form  of  tree  searching;  among  them, 
most  game  playing  programs  arc  built  around  an  efficient  tree  searching  algorithm  known 
as  the  alpha-beta  pruning  algorithm.  In  the  first  part  of  this  chapter,  we  investigate  the 
efficiency  of  this  algorithm  with  respect  to  a cost  measure  first  introduced  by  Knuth  and 
Moore  in  [35]  and  given  in  Definition  1.1  below.  The  second  pari  of  the  chapter  is 
devoted  to  the  study  of  a parallel  implementation  of  the  algorithm  on  an  asynchronous 
multiprocessor. 

Oofinition  l.I: 

Let  j be  the  number  of  terminal  positions  examined  by  some  algorithm  A in 
searching  a uniform  tree  of  degree  n and  depth  d.  The  quantity 

- Ijm^  . 

Is  catted  the  branching  factor  corresponding  to  the  search  algorithm  A.  I 

Analyses  of  the  ex-/?  pruning  algorithm  have  been  attempted  in  two  recent  papers  by 
Fuller,  Gaschnig  and  Gillogly  [23]  and  by  Knuth  and  Moore  [35].  Both  papers  address  the 
problem  of  searching  a uniform  game  tree  of  degree  n and  depth  d with  the  ot-(i  pruning 
algorithm  under  the  assumptions  that  the  static  values  assigned  to  the  terminal  nodes 
arc  independent  Identically  distributed  random  variables  and  that  they  are  ail  distinct.  We 
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Immediately  obtorve  that,  in  order  to  evaluate  the  branching  factor,  the  last  assumption 
requires  that  the  distinct  values  assigned  to  the  terminal  positions  be  taken  from  an 
Infinite  range.  For  most  practical  applications  this  is,  however,  unrealistic. 

Fuller,  Gaschnig  and'Gillogly  developed  in  [23]  a general  formula  for  the  average 
number  of  terminal  positions  examined  by  the  u-fi  procedure.  Their  formula,  however,  is 
computationally  Intractable  and  leads  to  undesirable  rounding  errors  for  large  trees  (i.  e., 
for  large  n and  d)  since  it  involves,  in  particular,  a 2d-2  nested  summation  of  terms  with 
alternating  signs  and  requires  on  the  order  of  steps  for  its  evaluation.  Then  they  gave 
some  empirical  results  based  on  a series  of  simulations,  and  compared  the  results  with 
actual  measurements  obtained  by  running  a modified  version  of  the  Technology  Chess 
Program  [24],  [25]. 

In  [35],  Knuth  and  Moore  have  analyzed,  under  the  same  conditions,  a simpler 
version  of  the  full  pruning  algorithm  by  not  considering  the  possibility  of  deep 
cut-offs;  they  have  shown,  in  particular,  that  the  branching  factor  of  the  resulting 
algorithm  is  Ofn/ln  n).  Knuth  and  Moore  also  considered  other  assumptions  to  account  for 
dependencies  among  the  static  values  assigned  to  the  terminal  positions  and  developed 
analytic  results  under  those  assumptions.  Their  paper  gives,  in  addition,  an  excellent 
presentation  and  historical  account  of  the  «-/3  pruning  algorithm. 

Departing  from  the  assumptions  of  the  two  papers  we  just  mentioned,  we  first 
consider  the  effect  of  possible  equalities  between  the  values  assigned  to  the  terminal 
nodes  of  a uniform  tree,  assuming  that  those  values  are  independent  identically  distributed 
random  variables  drawn  from  any  ditcmle  probability  distribution.  In  Section  2,  we 
establish  some  notations  and  preliminary  results,  and  in  Section  3,  we  derive  a general 
formula  for  the  number  of  terminal  nodes  examined  by  the  u-fl  pruning  algorithm  when  we 
take  into  account  both  shallow  and  deep  cut-offs.  The  evaluation  of  this  formula  requires 
only  a finite  summation  over  the  range  of  possible  values  assigned  to  the  terminal  nodes 
and  is  relatively  easy.  We  show,  in  particular,  that,  when  the  terminal  nodes  can  only  take 
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on  two  distinct  values,  the  branching  (actor  of  the  td-fi  pruning  algorithm  can  grow  with  n 
as  Of'n/ln  n)  for  some  choice  of  the  prohabilily  distribution.  In  Section  4,  we  show  that, 
when  the  discrete  probability  distribution  tends  to  a continuous  probability  distribution, 
the  summation  derived  in  Section  3 can  be  replaced  by  an  integral,  which  constitutes  the 
worst  case  over  all  discrete  probability  distributions,  in  Section  5,  an  analysis  of  this 
integral  shows  that  the  branching  factor  of  the  «••/?  pruning  algorithm  lor  a uniform  tree  of 
degree  n grows  with  n as  0(n/{r\  n),  therefore  confirming  a claim  by  Knuth  and  Moore  [35] 
that  deep  cut-offs  have  only  a second  order  effect  on  the  average  behavior  of  the 
u-fi  pruning  algorithm.  In  Section  6,  we  propose  a parallel  implementation  of  the 
oi-fi  pruning  algorithm  in  which  several  processes  search  for  the  solution  (i.  e.,  the  value 
associated  with  the  game  tree)  within  different  subintervals.  This  parallel  implementation 
is  analyzed  In  Section  7;  the  parallel  implementation  with  2 processes,  in  particular,  turns 
out  to  be  more  than  twice  as  efficient  as  the  original  oi-fi  pruning  algorithm,  which  is 
consequently  shown  not  to  be  optimal.  Some  concluding  remarKs  and  open  problems  are 
given  in  the  last  section. 

2 - Presentation  and  initial  properties  of  the  oc-/i  pruning  algorithm 

There  are  two  usual  approaches  for  dealing  with  searching  a game  tree.  Jn  [23], 
Fulter,  Gaschnig  and  Gillogly  adopted  the  M/n-Ma*  approach,  while,  in  [35],  Knuth  and 
Moore  chose  the  Wega-Afar  approach.  Wo  will  briefly  present,  in  Section  2.1,  the  two 
approaches  and  introduce  the  ot-fi  procedure  in  terms  of  the  Nega-Max  model.  Then,  in 
Section  2.2,  we  wilt  reestablish  an  initial  result  of  [23]  which  was  stated  in  terms  of  the 
Min-Max  approach. 


2.1  ' The  ei-fi  procedure 

Let  us  consider  a game  (like  chess,  checkers,  tic-tac-toe  or  kalah)  played  by  two 
players  who  take  turns.  It  is  common  to  represent  the  evolution  of  the  game  by  means  of 
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a game  tree,  wlioro  each  position  of  the  Bamc  Is  represented  by  a node.  If  the  position  is 
a dead-end,  the  node  Is  terminal,  otherwise  all  possible  moves  from  that  position  are 
represented  as  the  successors  of  the  node.  The  structure  of  the  tree  is  preserved  by  not 
generating  moves  leading  to  some  positions  already  generated  (thus,  avoiding  cycles)!  this 
is  the  function  of  the  mot/e  generator.  The  evaluation  function  is  another  important 
function  in  game  playing  programs;  It  assigns  to  each  terminal  position  a sfotic  value  by 
estimating  various  parameters  such  as  piece  counts,  occupation  of  the  board,  etc.  The 
evaluation  function  evaluates  the  terminal  nodes  from  one  player’s  viewpoint,  giving 
higher  values  to  positions  more  favorable  to  this  player.  It  is  convenient  at  this  point  to 
name  (he  two  players  Max  and  Min.  Hence,  Max’s  strategy  is  to  lead  the  game  towards 
positions  with  higher  values,  while  Min’s  strategy  is  to  lead  the  game  towards  positions 
with  lower  values. 

The  minimax  procedure  is  directly  ba.sed  on  this  formulalion  and  can  be  used  by 
either  Max  or  Min  to  decide  on  his  next  move  from  a given  position,  assuming  that  his 
opponent  will  respond  with  his  best  move.  Using  a rather  brute  force  approach,  the 
minimax  procedure  assigns  values  to  all  nodes  of  a game  tree.  It  first  assigns  to  terminal 
nodes  the  results  of  the  evaluation  function,  then  it  backs -up  to  internal  nodes 
corresponefing  to  a position  from  which  it  is  Max’s  (Min’s)  turn  to  play  the  maximum 
(minimum)  of  the  values  assigned  to  its  successors. 

Suppose  it  is  Max’s  turn  to  play  from  an  initial  position  (corresponding  to  the  root 
of  the  game  tree),  then  It  is  his  turn  to  play  from  any  positions  at  even  depth  and  Min's 
turn  to  play  from  any  positions  al  odd  deplh.  Therefore,  the  minimax  procedure  will 
back-up  values  to  the  nodes  of  the  game  tree  through  a succession  of 
Minimazing/Mnximazing  operations.  This  corresponds  to  the  Min-Max  approach. 

By  observing  that: 

max{  min{  Xj,  X2,  },  mln{  yj,  yp*  - ).  - ) “ 

max{  -max{  -xj,  -*2>  •••  '^2'  “ "•  ^ • 
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the  Min-Max  approach  tan  be  directly  roformulaled  into  the  Nega-Max  approach.  In  the 
Nega-Miix  formuiation,  a terminal  node  of  a game  tree  should  be  assigned  the  result  of  the 
evaluation  function  only  if  it  is  at  an  oven  depth  (assuming  it  is  initially  Max’s  turn  to 
play)  and  it  should  be  assigned  Ihe  opposite  of  the  result  of  the  evaluation  function  If  It  is 
at  an  odd  depth.  The  Nega-Mnx  approach  requires  the  same  operator  at  all  levels  of  a 
game  tree,  and’  the  uniformity  of  the  notation  will  make  It  easier  to  carry  out  an  analysis. 
This  approach  will  be  used  throughout. 

Figure  2.J  shows  the  effect  of  the  minimax  procedure  in  a uniform  tree  of  degree  2 
and  depth  < The  values  assigned  to  the  terminal  nodes  have  been  chosen  arbitrarily.  The 
path  indicated  by  a darker  line  shows  the  sequence  of  moves  selected  by  the  procedure. 


Figure  2.1  - Searching  a game  tree  with  the  minimax  procedure 


The  minimax  procedure  is  clearly  a brute  force  search  and,  when  exploring  a node, 
it  uses  none  of  the  information  already  available  from  the  nodes  previously  explored. 
Obviously,  by  taking  advantage  of  the  information  previously  acquired  we  can  easily 
improve  on  the  brute  force  search.  Figure  2.2  presents  some  simple  patterns  in  which  the 
distribution  of  the  information  could  lead  lo  such  improvements.  In  the  figure,  the  circled 
nodes  have  already  been  explored,  and  they  arc  labeled  with  their  backed-up  values;  the 
values  of  the  other  nodes  are  yet  to  be  determined.  We  are  interested  in  the  value  v of 
the  top  level  node  In  both  patterns  (a)  and  <b). 


J ' mn 
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(b)  deep  cut-off 


Figure  2.2  - Examples  of  possible  cut-offs 

Let  us  consider  the'  pattern  of  Figure  2.2  (a)  first.  From  the  definition  of  the 

• . 

minimax  procedure,  the  values  v and  * satisfy; 

u « maxj  3,  -X  } , * » mnx{  -2,  ...  ) , 

which  shows  that  * i -2  or  2 i -x.  Since  3 i 2 i -x,  it  follows  that  intiepentient  of  tho 
exact  value  of  *,  we  will  have  i/  » 5.  This  shows  that  we  need  not  explore  further  the 
successors  of  the  node  labeled  * if  we  arc  only  interested  in  the  value  of  v.  This  leads  to 
a first  type  of  cut-offs  Known  as  shallow  cut-offs. 

The  pattern  of  Figure  2.2  (b)  illustrates  a deeper  cut-off.  As  with  the  previous 
example,  there  are  immediate  relations  between  the  values  of  the  nodes.  In  particular,  we 
have  y i -x,  which  leads  us  to  consider  two  cases.  Either  y > -x,  and  this  means  that  the 
value  y is  determined  by  its  right  son(s)  and  certainly  does  not  depend  on  the  right  sonfs) 
of  /.  Or  y ■ -X,  In  which  case,  since  x i -y  and  x i -2,  we  deduce  * i -2  or  -*  s 2j  but 
since  i/  - max{3,  -*)  it  follows  that  v • 3,  independent  of  the  exact  value  of  * and,  a 
fortiori,  independent  of  the  exact  value  of  x.  This  shows  that  in  either  case  the  successors 
of  the  node  labeled  x need  not  be  further  explored  since  the  final  value  of  u would  in  no 
.way  bo  affected. 


The  two  examples  presented  in  Figure  2.2  indicate  that  a reduction  of  the  search 


PART  1:  SEQUtNTlAL  AI.PHA-BETA  PRUNING  ALGORITHM  75 


can  be  achieved  if  a node  passes  down  to  its  sons  the  current  value  backed-up  so  far  (3  in  ^ 

the  case  of  the  two  above  examples)  ar.  a bound  for  pruning  branches  2,  6,  ...  levels 

below;  the  bound  can,  of  course,  be  improved  as  the  search  progresses  down  the  tree 
(leading  to  more  and  more  possible  cut-offs). 


Using  two  bounds  for  even  and  odd  levels  of  a tree,  these  improvements  are 
implemented  in  the  following  procedure  adapted  from  [35], 

i 

I 
i 
i 


1 


The  Alpha-Beta  procedure  (from  [35]) 


integer  procedure  Al.PHABE'’‘A( position  P,  integer  alpha,  integer  beta); 
begin  integer  j,  f,  n; 

determine  the  successor  positions:  P^ P^; 

if_  n <=  0 then 

ALPHABETA  f(P) 

else 

begin 

' for  j I step  f until  n do 

begin 

t ;»  -ALPHABETA(Py, -beta, -alpha); 

, 11  t > alpha  then  alpha  :=  t; 


done: 


U alpha  i beta  then  goto  done 
end; 

ALPHABETA  ;=  alpha 
end 


(2.1) 


end 


The  function  denoted  by  / Is  the  evaluation  function  which  assigns  static  values  to  terminal 
positions. 

Knuth  and  Moore  [35]  have  shown  this  procedure  to  be  correct  in  the  sense  that  the 
call  Ai.PHABETA(P,-{o,*co)  assigns  to  position  P the  value  MINIMAX(P),  assigned  by  the 
minimav  procedure.  More  generally,  they  showed  [35,  p.  297]  that,  if  alpha  < beta: 


ALPHAnETA(P,atpha,bcla)  i alpha. 

if 

MINIMAX(P)  i alpha, 

(2.2) 

Ai  PMA(JETA(P,alpha.bcla)  - MINIMAX(P), 

if 

alpha  < MlNIMAXfP)  < beta. 

(2.3) 

PmAHC  r A<P.jMpha,hela)  a beta, 

if 

MINIMAX(P)  4 beta. 

(2.4) 

••  CM  )*•  r’.fur*  2 I to  i(tustr«<e  lh«  minimax  procedure  is  shown  in 


vatatU  •*  Itv  procedure  The  branc^iet  pruned  by  the 
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procedure  are  Indicated  with  dashed  linos,  and  the  nodes  marked  with  a circle  have  not 
been  compielely  explored. 


91  « 

/ \ / \ 


Figure  2.3  - Searching  a game  tree  with  the  a-ft  procedure 


We  observe  that  only  B out  of  the  16  terminal  positions  and  19  out  of  all  the  31  nodes  are 
examined  by  the  cx-/?  pruning  algorithm  in  this  example,  reducing  greatly  the  cost  of 
searching  the  tree.  As  is  seen  by  comparing  Figure.s  2.1  and  2.3,  the  values  backed-up  by 
the  fx-/?  procedure  lo  some  internal  nodes  are  not  necessarily  the  same  as  Ihe  values 
backed-up  by  the  minimax  procedure,  as  reflected  by  the  Indetermination  in 
equations  (2.2)  and  (2. A).  The  top  value,  however,  is  not  affected  by  this  indelermination. 


2.2  - Some  properties  of  the  oc-/3  pruning  algorithm 


In  this  section,  we  will  introduce  some  notations  which  will  be  used  throughout,  and 
we  will  reestablish,  in  terms  of  the  Nega-Max  approach,  an  initial  result  of  [23]  giving  a 
necessary  and  sufficient  condition  for  any  node  of  a game  tree  to  be  examined  by  the 
(X-/3  pruning' algorithm. 

2.2.1  - Notations 


As  in  [35],  we  wili  use  the  Dewey  decimal  notation  lo  represent  a node  in  a tree. 
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More  precisely,  lei  c,  Ihe  emply  sequence,  dcnole  Ihe  reel  of  the  game  tree.  Then,  If  g 
cJonoles  some  Internal  node  of  the  tree  with  n sons,  g.j  will  dcnole  Ihe  ;-th  son  of  node  3, 
for  j » i,  ...,  n.  In  Figure  2.4,  node  4,1.3.4.3  is  the  node  at  depth  5 whose  path  from  the 
root  is  indicated  with  a darker  line. 


c(A)  - 2 

0^4.1^  m -CO 

c(A.l.3)  - 3 

c('4.1.3.4>  - -5 

cf4.1.3.4.3;  - 0 


01(4.1.3.4.3)  - max{  c(4.i.3.4.3),  c(4.1.3),  c(4)  ) - 3 
(i(4.l.3.4.3)  m -max{  c(4.i.3.4),  c(4.l)  ] - 5 


Figure  2.4  - Portion  of  a game  tree  showing  the  path  to  node  <4.1.3.4.3> 

The  value  associated  with  some  node  ,7  of  a game  tree  by  the  minimax  procedure 
(see  Section  2.1)  will  be  denoted  by  u(3).  Then,  if  J is  a terminal  node,  v(3)  is  the  static 
valitf.  asigned  to  that  terminal  position,  and,  if  3 is  an  Internal  node,  v(3)  is  the  value 
backed“Up  to  node  3 by  the  minimnx  procedure.  In  the  latter  case,  If  node  3 bas  n sons, 
v(3)  is  given  by: 

v(3)  m maxj  -v(3.j)  \ i i.  jin].  (2.5) 

In  Figure  2.4,  the  nodes  on  the  path  from  the  root  to  node  4.I.3.4.3  are  evaluated  through 
formula  (2.5)  while  the  other  nodes  (including  4.1.3. 4.3)  are  shown  as  terminal  nodes  and 
are  assigned  arbitrary  values.  (Nodes  are  labeled  with  their  values.) 


While  the  values  v(3)  deal  with  the  static  aspect  of  a game  tree,  the  quantities  we 


78 


CHAPTER  IV 


wil(  introduce  next  deal  more  with  the  dynamic  aspect  of  the  tree  when  being  searched  by 
the  procedure. 

For  any  node  at  depth  d i t,  we  define: 
c(J.j)  - max{  \ I s.  i i j-I  } , 

(Dy  convention,  the  maximum  over  an  empty  set  is  defined  to  be  -oo|  in  particular, 

c(J.l)  - -CO.)  'For  the  root  of  the  tree  we  also  define  c(t)  - -<o.  The  quantity  c(J)  accounts 

for  the  information  provided  to  node  by  its  elder  brothers.  These  values  are  indicated 
to  the  right  of  the  game  tree  shown  in  Figure  2.4  (or  all  nodes  on  the  path  to  node 
4.1.3.4.3i  only  the  nodes  indicated  with  squares  are  used  in  computing  these  values. 

We  finally  define  for  any  node  3 “ Jj Jj  depth  d i I in  a game  tree  two 

quantities  directly  associated  with  node  3 by  the  w-/3  procedure.  For  i » 0,  ...,  d-1,  let 

- Ji jfi-i-  We  define: 

txf,?)  » max{  c(3i}  I i is  even,  0 i i i d-1  } , 

/3(3)  m -max{  c(3i)  \ i is  odd,  0 i i i d-l  ) . 

It  is  convenient  to  define  these  two  quantities  for  the  root  of  the  game  tree  by  cxfc)  - -eo 
and  /3(c)  - ♦<»  (which  is  consistent  with  the  definition).  These  of-  and  ^-values  are  shown 
In  Figure  2.4  for  the  node  4.t.3.‘f.3  along  with  their  definitions. 

2.2.2  *■  Necessary  and  sufficient  condition  lor  a node  to  be  explored  by  the  ec-^  procedure 

The  following  lemma  justifies  the  notations  we  ju.st  Introduced  in  the  preceding 
section. 

Lemma  2.1: 

Assume  that,  initially,  the  root  of  a game  tree  is  explored  by  the  or-/?  procedure 
through  the  call 

ALPHAOETA(root,-os,*w)  . (2-6) 

Then,  if  node  3 Is  examined,  it  is  through  a call  of  procedure  ALPHABETA  in  which  the 
parameters  alpha  and  beta  satisfy: 
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alpha  - o/Q)  , (2.7) 

beta  - (2.8) 

Proof: 

If  <?  - Jj y'j/  denotes  some  node  explored  by  the  procedure  at  depth  d i 1,  let,  as 

before,  - Jj for  0 s i &d-l.  Thus  node  is  the  father  of  node  while.  If 

node  is  the  brother  of  <7  inrimoriialely  preceding  ^ (and  explored  just 

before  ,?).  Observe  that,  if  jj  - f,  c(Jq)  = c(J)  » -ro  and  therefore; 
o/(J)  » max{  c(^^)  I i is  even,  0 i i i d-l  } 

- - [ -maxj  I i is  odd,  0 i i i d-2  } ] 

- 

(similarly,  /i(J)  m -uQ j)).  Observe  also  that,  if  i 2: 
ot(^)  m max{  k(J2)  , c(J)  ) 

- max{  a<(',72^  > ) 

and  that  /i(J)  » fiU 

By  the  call  of  line  (2.6),  relations  (2.7)  and  (2.8)  certainly  hold  for  the  root  of  the 
game  tree,  since  or(c)  » -co  and  fi(e)  • ♦«.  Then  Ihe  proof  follows  by  induction  from 
inspection  of  the  procedure  ALPHA0ETA,  and  from  the  relations  we  derived  above.  I 

The  following  theorem  states  a useful  relation  that  characterizes  the  fact  that  a 
node  of  a tree  is  explored  by  the  cx-/J  pruning  algorithm.  This  relation  was  first 
established  by  Fuller,  Gaschnig  and  Gillogly  [23]  with  different  notations  in  terms  of  the 
Min-Max  model. 

Theorem  2.1: 

Assume  that.  Initially,  the  root  of  a game  tree  Is  explored  by  the  er-fi  procedure 
through  the  call 

ALPHAOETA(root,-a>,*eo)  . 

Then,  an  arbitrary  node  } of  the  game  tree  is  subsequently  explored  If  and  only  If 


r 
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i Proof: 

I 

I Because  of  the  presence  of  line  (2.1)  in  the  procedure  ALPHABETA,  the  result 

follows  directly  from  Ihe  result  of  Lemma  2.1.  I 

Since  it  will  be  more  convenient  in  the  following  seclions,  rather  than  and 

we  will  use  the  quantities: 

- max{  cQ ^ I i is  even,  0 S t s d-1  ) , 

B<J)  - max{  c(J^)  | i is  odd,  0 i i i d-I  ) , 

where  is  defined  as  before.  The  definitions  of  A(^)  and  BQ)  are  more  symmetrical,  and 
relation  (2.9)  can  also  bo  rewritten  in  a more  symmetrical  way: 

A(3)  ♦ B(3)  < 0 . (2.10) 

3 - Number  of  nodes  explored  by  the  od-/?  procedure:  discrete  case 

As  in  [23]  and  [35],  we  will  evaluate  in  this  and  the  following  section  the  amount  of 
work  performed  in  searching  a random  uniform  game  tree  using  the  w-/?  pruning  algorithm. 
The  definition  and  some  properties  of  random  uniform  game  trees  are  given  in  Section  3.1. 
The  amount  of  work  performed  by  the  (x-/S  procedure  is  measured  by  the  number  of 
terminal  nodes  examined  during  the  search  and  is  evaluated  in  Section  3.2. 

3.1  - Random  uniform  gamo  tross 

In  order  to  perform  an  analysis  of  the  o/-ft  pruning  algorithm,  we  will  limit 
ourselves  and  consider  the  following  class  of  game  trees. 

Dofinition  3.1 : 

A game  tree  in  which 

(a)  all  Internal  nodes  have  exactly  n sons,  and 

(b)  all  terminal  nodes  (or  bottom  positions)  are  at  depth  d 
is  called  a uniform  game  tree  of  degree  n and  depth  d. 


A uniform  gamo  tree  which  satisfies  the  additional  condition 
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(c)  the  values  assigned  to  all  terminal  nodes  (or  bottom  ualuet)  are  Independent 
identically  distributed  random  variables 

Is  called  a random  uniform  game  tree,  or,  for  short,  a rug  tree.  ■ 

Unless  otherwise  specified,  we  will  only  consider  throughout  a rug  tree  of  degree  re 
and  depth  d. 

Since  the  value  backed-up  to  a node  by  the  minimax  procedure  only  depends  on  the 
backed-up  values  of  Its  sons,  we  immediately  observe  that,  by  condition  (c),  the  backed-up 
values  of  all  nodes  at  the  same  depth  are  also  independent  identically  distributed  random 
variables.  In  the  remainder  of  the  section,  we  will  assume  that  the  bottom  values  are 
drawn  from  the  finite  set  { - k/m  | -m  s At  sm  },  for  some  m > 0,  and  we  will  denote  by 
{Pi^k)}.  ■mikim  simply  Ip^(k)]  the  common  probability  distribution  for  the  backed-up 

a 

values  of  all  nodes  at  depth  d - i (i.  e.,  p^(k)  is  the  probability  that  the  value, 
backed-up  by  the  minimax  procedure  to  some  node  ^ at  depth  d-i  be  k/m).  In  particular, 
IPfjfk)}  is  the  common  probability  distribution  for  all  bottom  values,  and  {p^(Ac>)  is  the 
probability  distribution  for  the  value  backed-up  to  the  root  of  the  rug  tree. 

The  following  lemma  states  the  relations  between  these  probability  distributions. 

Lemma  3.1 1 

For  t - 0, ....  d-l,  we  have: 

Pi^l(-m)  ♦ ...  ♦ Piti(k}  - [pjC-At)  ♦ ...  ♦ p^l'm)]'* . (3.1) 

Proof: 

Let  J be  some  internal  node  at  depth  d-i-l,  then  by  equation  (2.5),  v(^)  s Ar  if  and 
only  If.  -v(3.j)i.k,  for  /,...,  re.  Equation  (3.1)  follows  easily  from  the  fact  that  all 
variables  are  independent.  ■ 
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For  convenience,  we  also  define  • 0.  Note  that  fjjlk)  Is  a non-decreasing  function 


of  k which  satisfies  pj('ni-i)  0 and  ■ pj(-m)  + „.  ♦ p^(m)  » 1. 

equation  (3.1),  we  sec  that  satisfies: 

By  rewriting 

9>i*t(-k-l)  - i - Ip/*)]'*  for  i mQ,  t, ...  , 

(3.2) 

and,  therefore: 

- f - If  - ip/*)rr  ‘O'-  f - 0, 1,  - . 

(3.3) 

The  following  quantities  will  also  be  useful  in  Section  3.2.  For 

i m 0,  l,  ...  and 

-m-i  i k i m,  define: 

Pifk)  - ] * [r/fcl]  ♦ , 

(3.4) 

and 

<r^(k)  - ] ♦ [f/-fc-I)]  ♦ ...  ♦ lp/-*-I)r‘'  . 

(3.5) 

Observe  that  pj[-m~l)  ■ o-jim)  ■ J and  pj(m)  a o-^(-ni-l)  ■ n. 


Lemma  3.1  establishes  the  probability  distributions  for  all  the  values  in  the  nodes 
of  a rug  tree.  The  next  lemma  establishes  a similar  result  for  the  quantities  c(^)  defined 
in  Section  2. 

Lemma  3.2i 

Let  ^.J  denote  any  node  at  depth  i,  where  i ■ 1, d.  If  y - 1,  cQ.j)  • -<o.  If 
j 2 2,  then  the  probabilily  distribution  of  c(3.j),  denoted  by  {flhQ-P].fns:kinv  satisfies: 

. * (3.6) 

Proofs 

When  J m I,  c(J.j)  ■ -«  by  definition.  When  j 2 2,  equation  (3.6)  follows  from  the 
same  argument  given  in  the  proof  of  Lemma  3.1.  I 

In  order  to  evaluate,  through  equation  (2.10),  the  probability  that  a terminal  node  is 
explored,  we  first  need  to  determine  the  probability  distributions  for  the  two  quantities 
A(i})  and  This  is  done  in  the  following. 
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Lemma  3.3: 

Let  ^ m Jt’jo  lerminal  node. 

(1)  If  - 1 for  all  cucn  Integers  i In  the  range  Oils  d-1,  then  A(J)  » -to. 

(2)  . Otherwise,  the  probability  distribution  for  AQ),  denoted  by 

satisfies: 

a_^(J)*  , (3.7) 

whore  the  product  denoted  by  TT^  Is  extended  to  all  even  Integers  In  the  range 
0 £ t s d-i. 

Similarly, 

(1 ')  If  » t for  all  odd  integers  t in  the  range  1 i i i d-l,  then  BQ)  ■ -oo. 

(2')  Otherwise,  the  probability  distribution  for  B(^),  denoted  / 
satisfies: 

b.„,Q)  * ...  * bf^<g)  - (3.8) 

where  the  product  denoted  by  TT^  is  extended  to  all  odd  integers  in  the  range 
! i i i d-l. 

Proof: 

Wo  will  only  consider  A(J)  since  the  proof  relative  to  B(^)  Is  the  same.  Part  (1) 

follows  direclly  from  the  definition.  For  part  (2),  let  3^  denote  the  node  j^.  We 

note  that  AfJ)  £ k If  and  only  if  c(3j}  s k for  all  even  inlegers  i In  the  range  0 £ i £ d-l 
such  that  2 2.  Since  the  variables  c(3^)  are  independent,  equation  (3.7)  follows  from 
equation  (3.6)  by  observing  that,  in  the  product  TT^,  a factor  corresponding  to  « 1 
amounts  to  1.  ■ 

I 

The  last  lemma  in  this  section  states  the  probability  of  exploring  a terminal  node. 


>r(3)  ■ i If  y'j  » f for  all  even  inlegers  i in  the  range  0 £ i £ d-l, 

H<3)  ■ f if  4 ■ f for  "li  odd  inlegers  ( in  the  range  / i i i d-l. 
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1 


itQ)  - Z . a,.Q)  ♦ ...  ♦ b.L.dJ)]  otherwise.  (3.9) 

Proof: 

When  « 1 for  all  even  IntcRers  i In  the  ranp.e  0 s t s d-1,  by  Lemma  3.3  /tC,?)  » -to. 
Hence  /tfjt)  ♦ B(J)  •»  -to  loo,  and  by  Theorem  2.1  node  J Is  certainly  explored.  Similarly 
when  - J for  all  odd  Integers  in  the  range  I i i i et-1. 

t 

Otherwise,  both  /tfjO  and  B(^)  are  finite.  Let  - r^.  We  observe  that 

* B(J)  < 0 if  and  only  if  -m  a k & m-J  and  a B(J)  s Hence,  equation  (3.9) 

follows  from  Theorem  2.1  and  the  fact  that  and  D(^)  are  Independent  variables.  ■ 

Using  equations  (3.7)  and  (3.8),  equation  (3.9)  can  be  rewritten  as: 

fr(J)  - f , , 

yf(,l)  » E A^e  <3-10) 

-niikim-l  ® ® ‘ o t 

(recall  that  ^^(-ni-1)  » 0). 

3.2  - Number  of  terminal  nodes  examined  by  the  oc-/3  pruning  algorithm:  discrofe  case 

We  arc  now  able  to  evaluate  the  amount  of  work  performed  by  the  procedure 
while  searching  a rug  tree.  As  in  [23]  and  [35],  we  have  chosen  to  measure  the  amount  of 
work  by  the  number  of  terminal  nodes  examined  by  the  procedure.  (We  will  also  consider 
briefly,  at  the  end  of  the  section,  the  total  number  of  internal  and  terminal  nodes  explored 
by  the  procedure  as  a measure  of  performance.) 

Theorem  3.1 : 

The  average  number,  of  bottom  positions  examined  by  the 

ex-^  procedure  in  searching  a rug  tree  of  degree  n and  depth  d,  for  which  the  bottom 
values  arc  distributed  according  to  the  discrete  probability  distribution 
Biven  by: 

N fi(m)  - * Z (TT  p/k)  - TT  p/k-D]  TT  <ri(k)  , (3.1 1) 

where  the  quantities  pjk)  and  rr^(k)  arc  defined  by  equations  (3.4)  and  (3.5),  and 
whore  the  products  denoted  by  TT^  and  TT^  are  defined  in  Lemma  3.3. 
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Proof: 

By  definition  of  the  probability  k(3),  the  average  number  of  bottom  positions 
examined  by  the  «-/?  procedure  is 

- Z nUV  , 

whore  the  jum  is  extended  to  all  terminal  nodes  3 “ ilh'  actually-  a 

cf-nested  summation  over  the  range  1 i Jq  i n,  1 i jj  s n, J s jgf.j  s n.  The  summation 
can  be  rearranged  as: 

- Z^fr(J)  ♦ Z^>((3)  * Z'  n(3)  - n(l.  A) , 

whore  the  three  summations  Z^,  Z^  and  Z correspond  to  the  throe  expressions  for  k(3) 

given  in  Lemma  3.4.  The  fourth  term  n(l 1)  is  subtracted  from  the  sum  since  It  is 

counted  by  both  Z^  and  Z^.  These  two  sums  are  easily  evaluated  since  all  the  terms  n(3) 
are  J.  As  »(1 1)  itself  is  I,  we  obtain: 

N^  /m)  - rtrrf/21  ♦ „Lrf/2j  . i * Z’  nQ) . (3.12)  ’ 

It  is  to  be  noted  that  the  first  three  terms  correspond  exactly  to  the  number  of  terminal 
nodes  examined  by  the  ot  -fi  procedure  under  optimal  ordering  of  the  bottom  values 
(see  [56,  p.  201]). 

We  now  evaluate  the  sum  Z . Inside  the  sum  the  terms  nQ)  can  be  evaluated 
through  equation  (3.10).  We  note  that  all  the  summations  relative  to  for  i • 0,  /,  ...,  d-l, 
can  be  done  independently,  each  one  being  the  sum  of  a geometric  series.  Using  the 
quantities  pj(k)  and  or^k)  defined  by  equations  (3.4)  and  (3.5),  we  obtain: 

Z'  nQ)  - ^ . [H,  pXk)  - n.  pA-D]  n.  O'/*)  - rr,  pAm-n  * I . 

The  theorem  follows  from  this  last  equation  and  equation  (3.12),  using  the  facts  that 
/o/m)  ■ n and  that  <r/ni)  - ■ 

The  formula  of  equation  (3.1 1)  can  be  easily  evaluated  and  provides  us  with  a 
measure  of  performance  for  the  a-fi  pruning  algorithm.  For  some  applications,  however 
(especially  when  the  cost  of  generating  moves  is  greater  than  the  cost  of  evaluating 
positions).  It  is  more  convenient  to  use  the  total  number  of  nodes  (internal  and  terminal) 
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explored  by  the  procedure  as  a measure  of  performance.  Let  denote  the  average 

of  this  number.  The  same  way  we  evaluated  we  can  evaluate  j(m)  by  summing 

the  probabilities  n<J)  over  all  nodes  of  the  tree.  Wo  obtain: 

- AfO /m)  ♦ N'/rrO  ♦ ...  ♦ N^^^(ni)  , 

whore  is  the  average  number  of  nodes  examined  at  depth  i,  ancf  is  directly 

derived  from  the  expression  of  N^  Jm)  in  equation  (3.11)  by  replacing  d by  i and  [pQ(k)} 
by  (recall  that  {pQ(k)]  is  the  probability  distribution  for  the  values  assigned  to 

the  terminal  nodes  and  that  is  Ihc  probability  distribution  for  the  values 

backed-up  to  rrodcs  at  depth  t). 


3.3  - Bi-vatued  rug  trees 

Although  it  is  rolalivciy  easy  in  most  game  playing  programs  to  obtain  (by 
inspection  of  the  evaluation  function)  an  accurate  bound  for  the  range  of  distinct  values 
assigned  to  the  various  positions  of  the  game,  it  is  usually  not  so  easy  to  derive  a good 
estimate  for  the  probability  distribution  of  Ihcse  values.  In  the  remainder  of  the  section 
we  will  study  rug  trees  in  which  the  terminal  nodes  can  only  take  on  two  distinct  values, 
and  we  will  see,  in  particular,  that  a change  in  the  probability  distribution  of  these  values 
can  lead  to  yery  important  differences  in  Ihe  growth  rate  of 

Wo  will  assume  in  the  following  that  the  values  assigned  to  the  terminal  nodes  of  a 
rug  tree  can  only  bo  either  -1  or  *1  with  respective  probabilities  l-p  and  p,  for  some 
p C [0,  /].  Under  those  conditions,  the  number,  T^^^(p),  of  terminal  nodes  examined  by  the 
er-fi  procedure  can  be  obtained  as  a particular  case  of  equation  (3.11)  in  which  m >*  i and 
defined  by  pq{-I)  - I-p,  pq(0)  - 0,  pq(1)  - p. 

Theorem  3.2: 

Let  pq  - p,  and,  tor  i - I,  2, ....  let  - 1 - p”.|. 

^n,d^P>  ■ ^ * (Pe-i>(Po-‘^  » <3-13) 


P 


Pi*l 

I - Pi 


n 


with 
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where  the  products  TT^  and  T7^  are  defined  as  before. 

Proof: 

Choose  m = ] and  define  the  probability  distribution 
Pq(0)  - 0 and  Pq(1)  » p.  Hence  y>Q(-2)  - 0,  fQ(-I)  - ro(0)  - P “ Pq  and  pq(I)  - I.  By 
equation  (3.2)  we  obtain: 

- 0 , p^(-l)  « pj(0)  - p-  , p^d)  ml,  for  i m 0,  I 

Then  equation  (3.13)  ^allows  directly  from  Theorem  3.1  and  equations  (3. A)  and  (3.5).  I 

Equation  (3.13)  can  be  evaluated  very  easily  and,  in  particular,  we  note  that  for 

0 < p < 1: 

^ • <3-1 

This  last  equation  shows  that  T^^(p)  reaches  its  minimum  ~ 1 for  p m 0 and 

p m t.  This  is  in  aBreoment  with  the  result  of  Slacle  and  Dixon  [56,  p.  201]  since  it 
corresponds  to  the  case  when  all  terminal  nodes  are  assigned  the  same  value  and 
therefore  all  possibte  cut-offs  do  occur.  Equation  (3.1  A)  also  shows  that  T^j(p)  admits  a 
maximum  for  p € (0,  1);  although  the  exact  maximum  cannot  be  readily  obtained,  we  will 
derive  a lower  bound  in  the  following.  Wc  first  establish  a preliminary  result. 

Lemma  3.5: 

The  unique  positive  root,  of  the  equation 
♦ * - 1 - 0 

is  in  the  Interval  (0,  JJ.  Asymptotically  (for  large  r)  it  satisfies: 

(3.15) 

Proof: 

As  there  is  no  ambiguity,  we  wilt  drop  the  Index  n from  In  the  following. 

Let  fffxj  ■ - 1,  note  that  g(0)  - -i  < 0 and  gdJ  « f > 0.  Since  is 

continuous  and  strictly  increa.ses  for  x positive,  the  equation  g(x)  « 0 admits  a unique 
positive  root,  f,  which  is  in  the  Interval  fO,  /). 


We  observe  that  equation  i - I -0  can  be  rewritten  as 


r 
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i - r 


I *(t  * I * {'^'0  ’ 

from  which  we  deduce  that 


i -r  > 


n * 1 


(3.16) 


On  Ihc  other  hand,  since  f - f , we  obtain 
n(i  - I)  > n In  f - InCi-f), 
which  shows,  along  with  equation  (3.16),  that 

1 - S < ^tnfft*!)  « S * 0(n.~^)  . (3.17) 

Similarly,  taking  the  logarithm  of  both  sides  of  equation  (3.17),  and  using  the  facts  that 
J - J"  - I"”  and  that  In  f > 1 - i , wo  obtain; 

r < L ^ 


I ♦ InCrtAn  n*l)  ’ 


hence: 


J - f > iln(n/lnn*l)  ♦ 0[{'ilna^2]  - ^ « ♦ Ofi  In  In  nj  . 

Equation  (3.15)  follows  directly  from  the  previous  equation  and  equation  (3.17). 

When  p • we  obtain  immediately  that,  for  i - 0,  1, ...,  Pj  - Hence 
Pe  - and  - [f„/(J-f„)]l'^/^J  . 

From  equations  (3.13)  and  (3.15)  it  follows  that,  for  large  n: 

Tn,H<Sn>  - Mon]'', 
while  equation  (3.14)  shows  that 


(3.18) 


(3.19) 


Equations  (3.18)  and  (3.19)  indicate  that  T^^^(p>  can  be  largely  influenced  by  the 
variations  of  the  probability  distribution  for  the  static  values.  This  result  can  be  easily 
generalized  to  In  the  next  section,  we  will  derive  an  approximation  to 

which  corresponds  to  its  worst  case  behavior. 

4 - Number  of  nodes  explored  by  the  o^-/S  procedure:  continuous  case 

In  this  section,  we  derive  an  approximation  to  by  considering  the  limit  of 

the  finite  series  of  equation  (3.11)  when  m lends  to  infinity  while  the  discrete  probability 


A 
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distribulion  to  ® continuous  probabilily  distribution.  This 

corresponds  to  tho  case  studied  by  Fuller,  Gaschnig  and  Gillogly  [23]  and  by  Knuth  and 
Moore  [35]  when  the  terminal  nodes  of  a rug  tree  are  all  assigned  distinct  values.  In 
particular,  we  will  reestablish  (with  a much  simpler  formula)  a result  of  [23]. 


4.1  - Notations  and  preliminary  results 


We  first  introduce  the  sequence  of  functions  {/^}  mapping  the  interval  [0,  i]  into 
itself,  and  defined  recursively  by: 

fQ<x)  - X , 

//x)  - 1-{1-  [fi.j<x)r}'^  for  i - 1.  2,  .. . 

It  is  readily  verified  by  induction  on  t lhal  all  functions  /j  are  strictly  increasing  on  [0,  i] 
and  satisfy  fj(0)  « 0 and  fj(I)  » I,  i.  e.,  0 and  1 are  two  fixed  points  of  the  functions  tor 
all  n and  t.  The  function  will  be  shown  to  be  related  to  the  quantities  defined  in 

Section  3.1.  Similarly,  in  relation  to  the  quantities  ^2/^^  define  the 

following  functions  on  [0,  /]:  for  t - 2, .»,  let 


r^(x) 


Sj(x) 


fjM 


If  we  define  r^(l)  - n and  s^(0)  » 1,  we  observe  that  all  functions  and  are  continuous 
on  [0,  /]  (they  are  actually  polynomials  in  *),  and  that  is  strictly  Increasing  while  is 
strictly  decreasing. 


In  relation  to  the  two  products  and  TT^,  we  also  Introduce,  for  f » f,  2,  ...,  the 
foltowing  functions  on  [0,  /]: 

R^(x)  m r^(x)  X _.  X * 

Sj(x)  m s^(x)  X X • 

where  Sj(x}  - 1.  Observe  here,  too,  that  functions  and  are  polynomials,  and  that, 
when  * Increases  from  0 to  /,  R^(x)  increases  from  / to  white  5^{x)  decreases  from 

to  /. 
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Lastly,  for  k m 0,  t,  2m*I,  let 

" PQ(k-m-t)  . 

Lemma4.lt- 

For  t • /,  2,  _ and  fc  ■ 0, 2m* 1 1 we  have: 

* (4.1) 

• (4.2) 

Proof: 

We  first  show  that  for  1 » 0,  1,  _.  and  It  - 0, 2fn*l: 

//r^)  - . (4.3) 

Since  fQ(x)  - *,  it  follows  from  the  definition  of  Zf^  that  equation  (4.3)  holds  when  t - 0. 
Assume,  for  induction,  that.equation  (4.3)  holds  for  1 > A.  Then  by  equation  (3.3) 

- / - {1  - , 

which  shows  that  equation  (4.3)  also  holds  for  i - h*i  (from  the  definition  of  /^4|). 

Observe  that  rj(Zf^)  • i * equation  (4.1)  follows 

from  equations  (4.3)  and  (3.4).  Similarly,  if  we  note  that  Sj(x)  can  be  rewritten  as 
, , 1 - {1  - 
■ i - 1<  - uUm  ■ 

equation  (4.2)  follows  from  equations  (3.2),  (4.3)  and  (3.5).  H 


4.2  - Number  of  bottom  positions  examined  by  the  procedure:  continuous  case 

Let  u;  return  to  the  definition  of  the  sequence  - {t!k\oiki2m*i- 
observed  in  Section  3.1  with  the  sequence  \fj(k)],  the  sequence  is  non-decreasing  and 
defines  a partition  of  the  interval  [0,  i],  i.  e.: 

0^Zf,iZji^.s  Z2,n  i ^2m*l  ~ ^ • 

The  norm  of  the  partition  is 

HT^II  - m8x{  \ I i k i 2/n*l  ) - max{  p^Cfc)  | -m  s 1:  s m ) . 

In  the  remainder  of  the  section  we  require  the  following. 


i 
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Aseumplion: 

(Al)  Um  niax{  pn(k)  | -/n  $ /c  s;  nt  } > 0.  ■ 

m.-*co  , 

. This  assumption  ensures  that  the  norm  of  the  partition  T^  tends  to  0 when  m tends 
to  infinity.  It  also  shows  that,  as  m tends  to  infinity,  the  probability  of  two  terminal 
nodes  being  assigned  the  same  value  vanishes.  This  corresponds  to  the  case  studied  by 
Putter,  Gaschnig  and  Gillogly  [23],  and  by  Knulh  and  Moore  [35]. 

With  this  assumption,  we  will  now  see  that  the  finite  series  of  equation  (3.11)  can 
be  replaced  by  an  integral  when  /n-»m.  This  is  established  in  the  following. 

Theorem  4.1; 

Under  assumption  (Al),  we  have: 

lim^  /m)  - , (4.4) 

where  is  the  first  derivative  of  R^x). 

Proof; 

Since  there  is  no  risks  of  confusion,  we  will  drop,  in  the  following,  the  Index  d from 
the  functions  R^  and  S^. 

It  follows  directly  from  Lemma  4.1  that  for  k m O,  2in*i: 

• rig  Pjfk-m-l) , 

S(^k>  - -m~I)  , 

which  shows  that  equation  (3.1 1)  can  be  simply  rewritten  as: 

[Rfe»)  - Sfti,/ . 

Let  denote  the  series  defined  in  this  last  equation. 

Recall  that  R(z}  is  a polynomial.  By  considering  the  Taylor  development  of 
we  obtain  for  Ic  ■ i,  2in*l: 

R(t^>  - R(c^.t)  ~ |r*-er*.;]/?Yp*)  ♦ ^ 1^*-*^*-;]^  R’Vt*) , 
where  ^ 

• ui.S^.1  [I'*-'*-/)' 


(4.5) 
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Since  R and  S are  polynomials,  the  quantity  \R“(x>S(y)/2\  is  bounded  by  some  constant, 
say  M,  for  any  x and  y in  [0,  /].  In  particular,  the  second  sum  in  equation  (4.5)  is  bounded 
in  module  by  W.||r^||.r*f2/n+f*^ol  " W.1|T^||  and  Ihoroforo  tends  to  0 when  m -*  co  since, 
from  assumption  (Al),  ||T^1I  0. 

As  for  the  first  sum  In  equation  (4.5),  we  observe  that  it  corresponds  to  a Riemann 
sum  for  the  function  R’(x)S(x)  over  the  partition  of  [0,  /].  Therefore  since,  in 
particular,  this  function  is  continuous  and  since  ||7'„jl|  tends  to  0,  the  sum  tends  to  the 
integral  of  equation  (4.4).  This  proves  the  theorem.  1 

In  the  remainder  of  the  section  we  will  reinterpret  the  limit  of  established 

in  Theorem  4.1. 

Let  C be  the  distribution  function  of  some  continuous  probability  density  function  g, 
and  assume,  to  simplify  the  discussion,  that  C(-t)  ■ 0 and  C(l)  ■ 1 (therefore,  C(x)  » 0 for 
X i -1  and  C(x)  - 1 for  » i 1).  We  define  a sequence  of  functions  for  m ■ 0,  1,  ...  as 
follows.  For  -m  i k i m,  let  ■ k/m.  Function  is  defined  as  the  foltowing  step 
function: 

0 if 

C(xf^)  if  X < , for  -m  i k i m-1  , 

1 if  1 - s * . 

The  sequence  of  functions  {G^j}  constitutes  a sequence  of  approximations  to  the 
continuous  function  G.  (It  should  be  noted  that  the  convergence  of  the  sequence  is 
uniform  on  the  interval  (0,  /J.)  The  function  G,,,  corresponds  to  the  cumulative  distribution 
of  the  discrete  probability  distribution  pg(k)  - C^(xff*)  - associated  with  the 

points  Xff  « k/m,  for  k - -m,  m. 


Using  the  approximation  lPo(k)}.^^i^^^  to  the  density  function  g,  equation  (3.1 1) 
provides  us  with  an  approximation  to  the  average  number  of  bottom  positions  examined  by 
the  tx-/3  procedure  in  a rug  tree  in  which  the  bottom  values  are  drawn  from  the  continuous 
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probability  density  (unction  g.  Wlien  m becomes  larger,  the  approximation  becomes 
better,  and  (due  to  the  uniform  convergence  of  the  sequence  C^)  it  can  actually  be  shown 
(in  a rather  technical  way)  that  the  limit  of  when  nt  -*  co  corresponds  exactly  to 

the  average  number  of  bottom  positions  examined  by  the  o/-/i  procedure  in  the  continuous 
case.  As  a matter  of  fact,  equation  (4.4)  could  be  derived  directly  by  considering  a 
continuous  probability  distribution  rather  than  a discrete  one  in  very  much  the  same  way 
we  derived  equation  (3.11)  in  Section  3.  This  result  is  stated  in  the  following. 


Theorem  4.2: 

Let  fQ(z)  ■ X,  and,  (or  t « I,  2, define: 


fiM  - 

1 - {f  - , 

r^(x)  « 

l-fi.t(x)  ’ 

fi<*> 

S^lX/  • 

Ri(x)  - 

r^(x)  * ...  K ''ft/2'l^*^  > 

- 

x^ix)  K _ x . 

The  average  number,  of  terminal  nodes  examined  by  the  K-/J  pruning  algorithm  in 
a rug  tree  of  degree  n and  depth  d (or  which  the  bottom  values  are  drawn  from  a 
continuous  distribution  is  given  by: 

^n,(i  - . (4.6) 


It  is  to  be  noted  that,  unlike  the  case  of  a discrete  probability  distribution,  when 
the  bottom  values  arc  drawn  from  a continuous  distribution,  the  number  of  terminal 
positions  examined  by  the  ot-fi  procedure  docs  not  depend  on  the  distribution  function. 

4.3  > Diacrota  case  versus  continuous  case 

Since  equation  (4.6)  has  been  derived  as  the  limit  of  equation  (3.11),  It  Is  reasonable 
to  investigate  the  validity  of  the  approximation  of  ^m)  by  As  was  seen  in 

Section  3.3,  strongly  depends  on  the  probability  distribution 
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therefore,  we  cannot  expect  ^ to  be  a close  approximation  of  ^(m)  In  all  cases.  We 
will  see  below,  however,  that  provides  us  with  a good  insight  into  the  behavior  of 
the  0/-^  pruning  algorithm.  Namely,  we  will  see  that  It  constitutes  the  worst  case  of 
discrete  probability  distributions. 

Since  was  obtained  as  the  limit  of  sufficient  to  show  that,  for  all 

probability  distributions  '*'*  have: 

• f^n,d  ^ (4.7) 

In  order  to  prove  inequality  (4.7),  It  is  convenient  to  give  a geometric  interpretation  of 
both  and  /V„ /mJ. 

Consider  the  curve  (£)  defined  by  the  Cartesian  coordinates  (x,  y)  through  the 
parametric  equations 

(Ox  \ x~R/t),  y-S/t)], 

where  the  parameter  t varies  in  the  interval  [0,  f].  The  Integral  of  equation  (4.6) 
represents  the  area  delimited  by  Ihe  curve  (O,  the  z-axis  and  the  parallels  to  the  y-axls 
at  the  abscissas  Rfi(0)  ■ / and  R^t)  - (see  Figure  4.1).  Since  Rj(0)  ■ 1 and 

Sj(0)  m nl*(/^i,  the  term  of  equation  (4.6)  can  be  accounted  for  by  the  area  of  the 

rectangle  delimited  by  the  x-axis,  the  y-axis  and  the  lines  x - i and  y - (the  latter 

line  extends  the  curve  (O  in  a continuous  way).  Figure  4.1  represents  the  curve  (O  and 
Its  extension  In  the  case  n - 3,  d - 6.  The  area  below  the  unbroken  lines  represents  the 
quantity  Af„^^. 

The  sum  of  equation  (3.11)  can  also  be  represented  along  with  the  curve  (O.  It 
follows  directly  from  the  relations  of  equations  (4.1)  and  (4.2)  that  the  terms  of  the  sum 
represent  the  areas  of  the  rectangles  delimited  by  the  lines  x - y " 0 

and  y ■ S(vif),  for  - 1,  2, ...,  2n~t.  The  quantify  Af^^e»)  represents  therefore  the  area  of 
Figure  4.1  shown  below  the  broken  lines. 
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S^(t) 


Flguro  4.1  - Gcomolrlc  interpretation  of  j and 

Inequality  (4.7),  then,  follows  directly  from  the  fact  that,  when  t Increases  in  [0,  IJ, 
R(t)  increases  while  SCf)  decreases. 

5 - On  the  branching  factor  of  the  u-fi  pruning  aigorithm 

Wo  have  deliberately  chosen  to  introduce  first  the  case  when  the  bottom  values  of  a 
game  tree  are  drawn  from  a discrete  probability  distribution  since  it  is  of  most  Interest  in 
practical  applications.-  The  case  of  a continuous  distribution,  however,  lends  Itself  more 
easily  to  an  analysis,  and,  since  it  constitutes  the  worst  cose  over  all  discrete  probability 
distributions,  we  will,  in  this  section,  examine  the  Integral  of  equation  (4.6)  rather  than 
the  series  of  equation  (3.11). 
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5.1  - Previous  rosulls 


In  Section  1,  we  Introduced  the  branching  factor  as  a cost  measure  for  the  work 
Involved  in  searching  a tree.  Rather  than  considering  the  number,  of  terminal 

positions  examined  by  a search  algorithm,  as  a measure  of  performance  of  the  algorithm, 
we  could  have  considered  the  total  number,  of  nodes  (terminal  and  Internal)  explored 
during  the  search.  In  the  case  of  the  u-fi  pruning  algorithm,  since  given  by 

equation  (4.6),  does  not  depend  on  the  distribution  function  of  the  bottom  values,  we 
deduce  that  satisfies: 

~ ^ * ^n.l  * 

It  can  be  checked  easily  that  0 s ^ ^n,i‘  iberefore  j s.  j i ^^n,d' 

Thus,  Definition  1.1  provides  us  with  a measure  of  performance  useful  to  compare  search 
algorithms.  In  the  following,  we  review  some  of  the-  results  which  have  already  been 
presented  in  the  literature. 

Minimax  search 

The  minimax  search  examines  systematically  all  nodes  of  a tree.  It,  therefore, 
examines  terminal  nodes  in  a uniform  tree  of  degree  n and  depth  d,  leading  to  a 

branching  factor 

^minimax^'^'^  - n . 

U’/S  procedure  under  optimal  ordering 

Slagle  and  Dixon  [56,  p.  201]  have  shown  that,  when  all  possible  ot~  and  /S-Cut-offs 
occur,  the  xn-fi  procedure  examines 

N^d  - ^ 

terminal  positions.  In  this  case,  the  corresponding  branching  factor  is 
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u-fi  procoduro  (expariinontal  results  from  [23]) 

Based  on  a scries  of  simulation  results,  Fuller,  Gaschnig  and  GillORly  [23]  have 
argund  that  the  formula 

- c(d).n0.72d  ♦ 0.277 

constitutes  a reasonable  approximation  lo  the  number  of  bottom  positions  examined  by  the 
0/-/3  procedure  for  small  values  of  n and  d,  and  that  i s c(d)  i 2 (at  least  for  the  range  of 
values  they  considered).  For  purposes  of  comparison,  lei  us  assume  that  their 
approximation  can  be  extrapolated  for  any  n and  d.  Provided  that  c(d)^^^  -*  I when  d -*  to, 
we  obtain 

In  view  of  the  results  of  Section  3.3,  we  can  question  the  accuracy  of  the  approximation 
for  large  n since  it  follows  from  Theorem  3.2  that 

■ O(nArxn). 

oc-fi  procedure  without  deep  cut-offs 

Knuth  and  Moore  [35]  have  analyzed  a simpler  version  of  the  oi-fi  procedure  by  not 
considering  the  possibilities  of  deep  cul-offs.  This  fi-procodure  is  the  same  as  the 
O'-/?  procedure  except  that  no  a-values  are  passed  to  the  u-fi  procedure;  instead,  the 
lower  value  ot  is  always  set  lo  -oj  before  exploring  the  successors  of  a node.  Knuth  and 
Moore  have  shown  that  the  branching  factor  of  this  procedure  satisfies 
^ ^(n)  » ©Crt/ln  n)  . 

Note  that,  smee  the  /?-proceduro  always  explores  more  nodes  at  any  depth  In  a tree  than 
the  full  «-/?  procedure  does  in  the  same  tree,  /i.^(n)  provides  us  with  an  upper  bound  for 

5.2  - Bounds  on  the  branching  factor  of  the  procedure 

In  this  section  we  will  derive  some  lower  and  upper  bounds  on  the  branching  factor 
of  the  pruning  algorithm.  In  particular,  since  the  lower  bound  we  derive  grows  with  n 
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as  w/ln  n,  we  will  be  able  to  conclude,  using  the  result  on  the  branching  factor  of  the 
a-fi  procedure  without  deep  cut-offs  established  by  Knulh  and  Moore  in  [35],  that  the 
branching  factor  of  the  oi-li  procedure  is  ©fn/ln  n). 

Wc  introduced  in  Section  4.1  the  sequence  of  functions  /^,  i > 0,  I,  from  [0,  /]  to 
itself,  and  we  observed  that  all  functions  share  the  two  fixed  points  0 and  i 
(independent  of  n).  Another  common  fixed  point,  which  depends  on  n,  was  Introduced  In 
Section  3.3. 

Lemma  5.1 1 

For  a given  n,  all  functions  /j,  for  i - 0,  1, share  the  common  fixed  point 
^ unique  positive  root  of  the  equation 

*«  ♦ * - i . 0. 

Proof: 

For  clarity,  we  will  drop  the  index  n from  in  the  following. 

Since  fQ(x)  • f Is  certainly  a fixed  point  of  /q’,  assume,  for  Induction,  that 
. f,  then  from  the  definition  of  we  have 

fi(V  - I -u  - t - a-sV'^  - / - r"  - r , 

which  shows  that  f is  a fixed  point  common  to  all  functions  f^,  i - 0,  I,  ■ 


Since  is  a fixed  point  common  to  all  functions  i • 0,  i, ._,  It  Is  easy  to  evaluate 
at  this  point  the  functions  and  defined  in  Section  4.1.  For  i m l,  2,  _.,  we  deduce  that: 

•■i(tn>  - Hdn}  - (5.1) 

In  particular,  it  follows  from  Lemma  3.5  that,  for  large  n: 

Equations  (5.1)  and  (5.2)  wilt  be'  useful  to  obtain  the  desired  bounds  in  the  remainder  of 
the  section. 

The  geometric  representation  of  equation  (4.6),  given  In  Figure  4.1,  makes  It  easy  to 
derive  bounds  on  the  quantity  They  are  stated  in  the  following. 


i 
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Thoorom  5.1 1 

The  branching  factor  of  the  ar-/l  pruning  algorithm  In  the  search  of  a rug  tree  of 
degree  n satisfies: 

rt/ln  n ~ ^ ^ ~ n/i/ln  n , (5.3) 

for  n M 2,  3 

Proof: 

Since,  when  t Increases  In  fO,  f],  Increases  while  S^(t)  decreases,  It  follows 
directly  that  for  any  » in  (0,  7]  we  have  the  following  inequalities: 

< R^(k)S^{0)  * [R^d)  - R^(oi)].S^(oi)  . (5.4) 

If  we  choose  a » we  have  R^(oi)  » and  S^(oi)  ■ []"„/(/ Since 

Rjd)  • and  Inequality  (5.3)  follows  Immediately  from  inequality  (5.4) 

and  the  results  of  Lemma  3.5.  B 

As  an  immediate  consequence,  we  obtain  the  following. 

Theorem  5.2: 

The  branching  factor  of  the  pruning  algorithm  In  the  search  of  a rug ‘tree  of 
degree  n satisfies,  for  targe  n: 

- 0(nAn  n)  . 

Proof: 

The  result  comes  directly  from  the  lower  bound  " nAn  n of  Theorem  5.1, 

and  from  the  upper  bound  R^(n)  obtained  for  the  a-fl  procedure  without  deep  cut-offs, 
which  Kniilh  atnl  Moore  have  shown  to  be  0(n/ln  n).  B 

This  results  confirms,  as  was  suggested  by  Knuth  and  Moore  [35,  p.  310],  that  deep 
cut-offs  have  only  a second  order  effect  on  the  behavior  of  the  u-fl  pruning  algorithm.  On 
the  other  hand,  it  shows  that  the  formula  proposed  by  Fuller,  Gaschnig  and  Gillogly  in  [23] 
and  mentioned  in  Section  5.1,  If  it  constitutes  a reasonable  approximation  for  small  values 
of  ft  and  H (the  range  of  values  they  considered  Is  n * ef  s 12),  is  certainly  not  adequate  for 
large  values. 


I 
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We  note  that  the  bounds  of  Theorem  5.1  were  obtained  without  dlfflcutty  by 
convenientty  Ci loosing  just  one  point,  on  the  curve  il.)  since  U was  easy  to  evatuate 
both  and  In  the  next  section,  using  a different  approach,  we  will  derive  a 

lighter  upper  bound  for  and  hence  for 

5.3  - Improved  upper  bound 

Since,  for  d m 1,2, ...,  <<  s ^nd*l  ^ '^^n,d’  lends  to  some  limit 

when  d lends  to  infinity  as  an  euen  integer,  this  quantity  tends  to  the  same  limit  when  d 
tends  to  infinity  as  an  odd  integer.  Therefore,  without  loss  of  generality,  we  will  only 
consider,  in  this  section,  the  case  when  d is  an  even  integer.  Let  d - 2h. 

For  * in  [0,  f]  and  for  i • 1,  2, ...,  we  define  p^(x)  - r^*)i-^x). 

Lemma  5.2: 

All  funcUons  p^,  for  i - 1,  2, ...,  have  the  same  absolute  maximum,  M^,  In  the 
Interval  [0,  /]. 

Proof: 

From  the  definitions  of  r-^x)  and  t^x)  we  have  for  i m I,  2, ...: 
r^(x)  - rflf^_i(x)] , 

and 

s^(x)  - i/f/i-iM. 

Therefore,  for  i - 1,2 we  also  have,  from  the  definition  of  Pj(x)i 

Pj(x)  - Pi[fi.i<x)]. 

The  lemma  follows  by  observing  that,  for  t » /,  2, ...,  is  a one-to-one  function  from 
(0,  /]  to  itself.  B 


Lemma  5.2  shows  that,  in  order  to  study  the  maximum  of  Pj(x),  when  * £ [0,  /],  it  is 
sufficient  to  study  the  maximum  of  the  polynomial 
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Observe  that  i particular,  since  it  can  be  checked  easily 

that,  for  « - 2,  3, ...,  > ^/<l*y/n),  it  follows  that 

> n for  rt  - 2,  3 (5.5) 


Theorem  5.3 

The  branching  factor  of  the  01-/3  pruning  algorithm  for  a rug  tree  of  degree  n 
satisfies: 

where  is  defined  in  Lemma  5.2. 

Proof: 

From  the  definition  of  '*'®  o^’^ain  for  h • 7,  3, ...: 

R2h(t)  - ' 

By  multiplication  by  it  follows  that 

/?2/,<'t).52/i^*^  - f^2h-2^f^-^2h-2^^^-Ph^^^  * ^2/i-2^‘'^-'^2/i-2*^^^-''/i^‘^-*A^*^  • 

Since,  for  t C [0,  1],  all  ladors  in  Ihis  equation  are  non-negative,  we  deduce,  using  the 
results  of  Lemma  5.2  and  the  fact  that  s^(t>  s,  n when  t C [0,  1],  that: 

Since,  in  addition, 

R2<t)  S2(t)  - r](t)sj(t)  S nr'j(t), 
it  follows  that  for  t C [0,  1]  and  /»  « 1,  2, ...: 

R2f,(t)  S2h(t>  ^ [r;(t)  + ...  ♦ r'f,(t)] . (5.7) 

Let  j be  the  integral  defined  in  equation  (4.6).  By  integrating  Inequality  (5.7)  over 
(0,  /]  we  see  that  satisfies: 

I^  2h  ^ [Mn-i)]  - n(n-l)  h 

since  r^/O)  - f and  r^Cf)  ■ n for  i - I,  2 This  shows  that 

^rt,2A  ^ • 

Equation  (5.6)  now  follows  directly  from  inequality  (5.5).  I 
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5.4  ~ Numerical  rosulta 


Table  5.1  summarizes  the  rcsiills  of  this  section.  It  presents  the  various  lower  and 
upper  bounds  we  have  derived  for  the  branchine  factor  of  the  oi-fi  pruning  algorithm  from 
equations  (5.3)  and  (5.6). 


tower  bound 

upper  bounds 

n 

Sn/<^-Sn>. 

from  [35] 

2 

1.618 

1.622 

1.799 

1.884  ■ 

3 

2:148 

2.168 

2.538 

2.666 

4 

2.630 

2.678 

3.243 

3.397 

5 

3.080 

3.166 

3.924 

4.095 

6 

3.506 

3.638 

4.587 

4.767 

7 

3.915 

4.098 

5.235 

5.421 

B 

4.309 

4.549 

5.872 

6.059 

9 

4.692 

4.993 

6.498 

6.684 

10 

5.064 

5.430 

7.116 

7.298 

U 

5.427 

5.862 

7.726 

7.902 

12 

5.782 

6.290 

8.330 

8.498 

13 

6.130 

6.713 

8.927 

9.086 

14 

6.473 

7.133 

9.519 

9.668 

IS 

6.809 

7.549 

10.107 

10.243 

16 

7.141 

7.963 

10.689 

10.813 

17 

7.468 

8.373 

11.268 

IU78 

IB 

7.791 

8.782 

11.842 

11.938 

19 

8.110 

9.188 

12.413 

12.494 

20 

8.425 

9.591 

12.980 

13.045 

21 

8.736 

9.993 

13.545 

13.593 

22 

9.045 

10.393 

14.106 

14.137 

23 

9.350 

10.791 

14.665 

14.678 

24 

9.653 

11.188 

15.221 

15.215 

25 

9.952 

11.583 

15.774 

15.748 

26 

10.250 

11.976 

16.325 

16.265 

27 

10.545 

12.369 

16.873 

16.770 

2B 

10.838 

12.759 

17.420 

17.288 

29 

11.128 

13.149 

17.964 

17.796 

30 

11.416 

13.537 

18.507 

18.300 

31 

11.703 

13.924 

19.047 

18.802 

32 

1L9B7 

14.310 

19.586 

table  5.1  - Bounds  on  the  branching  (actor  of  the  w-/3  pruning  algorithm 

Although  we  have  not  been  able  to  give  an  estimate  for  the  a$/mptotic  growth  of 
we  can  easUy  derive  an  upper  bound  (or  this  quantity  by  studying  rug  trees  of  depth 
2 since: 
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which  shows  that  a 0(n/An  n).  The  numerical  results  of  Table  5.1  indicale  that 
is  a much  better  upper  bound  for  than  for  the  range  of  values  we 

have  considered. 


6 - A parallel  oL~fi  pruning  algorithm 

When  several  processes  arc  available  a solution  that  comes  naturally  to  mind  for 
implementing  the  ot-ft  pruning  algorithm  is  to  have  each  process  explore  in  parallel  a 
different  subtree  of  the  entire  game  tree.  Each  subtree  would  be  explored  using  the 
u-(i  procedure  to  back-up  its  value  to  its  root,  say  some  node  P,  then  the  value  should  be 
reported  to  the  father  of  node  P in  order  to  decide  if  the  remaining  brothers  of  node  P 
can  be  pruned. 

A possible  implementation  for  this  solution  is  to  have  the  parallel  algorithm 
organized  around  a static  decomposition,  of  the  game  tree,  for  example,  by  generating  first 
all  nodes  at,  say,  depth  I or  depth  2 before  starting  all  processes  In  parallel.  As  Is  shown 
in  [37],  however,  static  decomposition  is  not  well  adapled  for  execution  on  an 
asynchronous  multiprocessor;  this  is  especially  true  when  processes  have  different  speeds 
and  the  various  subtasKs  have  different  sizes. 

A dynamic  decomposition  of  the  game  tree.  On  the  other  hand,  is  better  suited  for 
the  processes  to  adjust  their  loads  according  to  their  own  speeds.  We  immediately 
observe,  however,  that  a dynamic  Implementation  will  require  a global  data  structure  for 
the  processes  to  communicate  among  themselves.  Since  this  data  structure  has  to  be 
updated  by  more  than  one  process  in  parallel,  synchronization  will  almost  necessarily  be 
required  to  preserve  the  validity  of  the  structure  at  any  time;  In  consequence,  this  wilt 
create  a large  (and  unwanted)  overhead. 
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Most  important  is  that,  by  exploring  in  parallel  and  independently  different  subtrees 
of  the  game'  tree,  we  loose  the  power  of  the  pruning  algorithm.  By  iooking  back  at 
the  original  algorithm,  we  observe  that  it's  efficiency  is  mainly  achieved  by  the  fact  that, 
at  any  point  during  the  search,  the  decision  of  pruning  branches  is  based  upon  all  the 
information  previously  acquired  during  the  search.  Obviously,  when  different  subtrees  are 
explored  independently  in  parallel  rather  than  sequentially,  less  information  is  available 
to  each  process,  and,  consequently,  in  the  overall  more  nodes  have  to  be  explored.  As 
will  be  seen,  the  parallel  algorithm  we  propose  below  for  the  u-fi  pruning  does  not  suffer 
from  the  loss  of  Information  communicated  between  the  various  processes. 

6.1  - A parallel  implomontalion  for  the  u-fi  pruning  algorithm 

While  proving  the  correctness  of  Ihe  ALPHABETA  procedure,  Knuth  and  Moore  [35] 
have  established  equations  (2.2),  (2.3)  and  (2.4)  mentioned  in  Section  2.  We  now 
reinterpret  these  equations.  Let  V - ALPHA0ETA(P,a;,^),  and  let  Vq  - MINIMAX(P).  It 
follows  directly  from  equations  (2.2),  (2.3)  and  (2.4)  that  when  u < 

if  V s (X  then  VQiu,  (6.1) 

If  ei<V<  /3  then  Vq  m V , (6.2) 

a Vi  fi  then  VQifl.  (6.3) 

The  value  Vq  (and  the  path  in  the  game  tree  associated  with  that  value)  is  the  solution  we 

are  seeking  when  the  node  P is  the  root  of  the  game  tree.  Equations  (6.1)  to  (6.3)  suggest 
that  the  problem  of  finding  the  solution  Vq  can  be  viewed  as  the  problem  of  locating  the 
root  of  a menotonic  function  over  some  interval  using  only  asynchronous  parallel 
evaluation  of  the  function.  (This  root  finding  problem  has  been  studied  by  Hyaflt  and 
Kung,  see  [37]  and  [44].)  Several  differences  are,  however,  immediately  noticeable.  In 
the  root  finding  problem  we  are  only  looking  for  an  approximation  to  the  root  and  each 
evaluation  of  the  function  takes  place  at  a singte  point.  In  the  game  tree  searching 
problem,  on  the  other  hand,  we  are  interested  in  the  exact  solution  and  each  Intermediate 
search,  or  partial  starch,  executed  through  the  call  ALPHABETA(P,(x,^),  examines  an  open 
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Interval:  (ot,  (i).  -Equation  (6.2)  shows  that,  provided  the  exact  value  lies  In  this  open 
interval,  the  call  returns  the  exact  solution,  and  this  terminates  the  entire  search.  The 
followinB  program  gives  a parallel  implementation  of  the  a-/!?  pruning  algorithm  based  on 
this  decomposition. 


Program  A: 

global  integer  CAl.PHA,  CBETAi 


Initialization: 

begin 

CAl.PHA  :m  -coi  CBETA  ;»  *coi 
start  processes  Pj,  Pj^ 
end 


Process  P : 
begin 

integer  A B . V •, 

\<A  B J SELECTNLWINTERVAL); 
while  < Bj  ^ 

begin 

V ABlRoot.A  .5  .true); 

il  VjiAj  then 
■'fecgin 

{CBETA  minfCBETA.A:*!); 

' (A..  BJ  SELECTNCWINVeRVAL) 

eng 

else 


(6,4) 


il  VjiBj  (hen 
Dcgin 


{CALPHA  :m  ma^iCALPHA.B  ,-1); 
(A:,  Bj) :-  SELECTNEWINTERVAL) 

end 


(6.5) 


else 


begin 

{CALPHA  CBETA  1/  )j 
11.^  1/  V 


return  Ihe  solution:  Vj\ 
terminate 


(6.6) 


end 


end; 

terminate 

end 


The  two  global  variables  CAl.PHA  and  CBETA  define  the  current  open  interval 
Known  to  contain  the  solution  Vq  (When  this  solution  is  found,  however,  both  CALPHA  and 
CBETA  are  net  to  Vq.)  The  Interval  (CALPHA,  CBETA)  is  initialized  to  («eo,  ♦»)  and  is 
updated  each  time  a process  finisf  s a partial  search  over  (he  game  tree.  The  procedure 
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SELECTNEWINTERVAL  uses,  without  modifying  them,  the  variables  CALPHA  and  CBETA  (as 
well  as  A^,  Af^  and  Bj, ....  6j^)  lo  determine  a new  interval  (Aj,  Bj)  over  which  process 
Pj  will  proceed  to  a new  partial  search.  This  procedure  is  critical  to  the  efficiency  of 
Program  A and  will  be  discussed  in  more  detail  in  Section  7.  For  the  time  being,  we  will 
only  assume  that  it  meets  the  following  specifications.  Given  the  variables  CALPHA  and 
CBETA  (and  the  variables  Aj,  ...,  Af^  and  Bj, ....  B^),  let  (A,  B)  :■  SELECTNEWINTERVAL: 

(a)  A m B if  C/UJ>HA  - CBETAi 

(b)  A < B otherwise. 

As  we  are  only  dealing  with  integers,  condition  (b)  is  equivalent  to  the  condition  A < B-i. 

Since  the  two  global  variables  CALPHA  and  CBETA  are  updated  in  parallel  by 
several  processes,  their  use  is  reslrlcled  within  critical  section  (indicated  In  Program  A 
with  curly  brackets);  the  use  of  the  procedure  SELECTNEWINTERVAL  also  occurs  within 
critical  section.  . 

ThooramG.lt 

At  any  time  in  the  execution  of  Program  A (outside  a critical  section),  the 
solution  Vq  satisfies  either  one  of  the  following  two  conditions: 

CALPHA  <Vq<  CBETA  , (6.7) 

CALPHA  - Vfl  - CBETA  . (6.8) 

' Proof! 

After  initialization,  at  lime  tjj,  the  variables  CALPHA  and  CBETA  are  only  modified 
(in  a critical  section)  through  one  of  the  instructions  (6.4),  (6.5)  or  (6.6)  executed  at  the 
time  instants  tj,  »ff»~  (with  for  ii2).  After  t^,  CALPHA  m -co  and 

CBETA  - ♦»,  therefore  condition  (6.7)  is  certainly  satisfied.  Assume  that  after  fj_],  for 
t a f,  condition  (6.7)  or  (6.8)  is  satisfied.  If  Instruction  (6.6)  Is  executed  at  time  by 
process  Pj,  it  follows  from  equation  (6.2)  that  Vj  " Vq,  therefore  condition  (6.8)  is  satisfied 
after  tj.  If  Instruction  (6.4)  is  executed  at  time  by  process  Pj,  It  follows  from 
equation  (6.1)  that  Vq  i Aj,  or  eqivalently  Vq  < Aj*t  (recall  that  both  Vq  and  Aj  are 
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integers)!  If,  prior  to  condition  (6.7)  were  satisfied,  then  Vq  < CBETA,  which  shows  that 
Vq  < min(CUETA,Aj*l)  and  condition  (6.7)  remains  satisfied  after  t^l  if,  prior  to 
condition  (6.8)  were  satisfied,  then  CBETA  Vq  < Aj*i,  which  shows  that 
min{CBETA,Aj*l)  ••  CBETA  and  condition  (6.7)  remains  satisfied.  The  same  holds. when 
Instruction  (6.5)  is  executed.  ■ 

Theorem  6.1,  along  with  the  specifications  (a)  and  (b)  of  the  procedure 
SELECTNEWINTERVAL,  proves  the  correctness  of  Program  A in  the  sense  that  if  the 
program  terminales  it  generates  the  correct  solution. 

Proving  the  termination  of  Program  A,  on  the  other  hand,  requires  additional 
specification  of  the  procedure  SELECTNEWINTERVAL.  Observe,  for  example,  that,  if  we 
always  have  Aj  = Bj-1,  the  open  interval  <Aj,  Bj)  does  not  contain  any  integer  (Aj  and  Bj 
are  integers  themselves)  and  no  solution  can  ever  be  found.  If,  however,  we  replace 
condition  (b)  above  by: 

(b’)  A £ B-2  otherwise, 

it  can  be  shown  easily  that  the  length  of  the  interval  (CALPHA,  CBETA)  decreases  at  least 
by  / each  time  a process  completes  a partial  search.  Since  in  a practical  implementation 
the  interval  (-oo,  *00)  is  actually  a finite  interval  in  which  we  know  that  the  solution  Vq  Is 
to  be  found,  we  are  guaranteed  of  the  termination  of  Program  A under  condition  (b‘). 

6.2  ~ Some  improvements  on  Program  A 

A feature  of  the  parallel  implementation  presented  in  Section  6.1  is  that 
Intercommunication  between  processes  is  reduced  to  a minlmium,  and  confined  to  the 
selection  of  a new  interval  over  which  a partial  search  is  to  take  place  next.  As  a 
consequence,  once  a process  has  initiated  a partial  search,  it  runs  until  completion 
oblivious  of  the  results  of  the  other  processes.  This  can  obviously  be  overly  wasteful 
since  the  Interval  searched  by  a process  might  be  ruled  out  by  some  other  process  very 
soon  after  the  beginning  of  the  search. 
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This  shortcoming  can  be  eliminated  in  several  ways.  First,  a process  completing  a 
partial  search  could  check  all  other  processes,  causing  them,  if  necessary,  either  to  abort 
their  searches  or  to  readjust  their  intervals.  This  solution,  however,  requires  a lot  of 
book-keeping  and  becomes  unpractical  when  a large  number  of  processes  are  cooperating. 

Another  solution  is  to  have  each  process  modify  its  own  interval  by  regularly 
checking  possible  changes  of  the  variables  CALPHA  and  CBETA  during  the  search.  Let 
A’  i A < B i B\  and  consider  the  two  calls: 

ALPHAl3ETA(Root,/l’,B')  and  ALPHABET A(Root,/t, B) . 

It  is  easy  to  check,  by  induction,  that  if  node  P is  explored  by  the  second  catl,  through 
ALPHABETA(P,«,/3>,  node  P is  also  explored  by  the  first  call,  through  ALPHABETA(P,«*,/3'). 
Moreover,  the  bounds  <x,  /3,  w’  and  fi'  satisfy: 

u m max{(x*,^]  , (i  - min{/?',B}  , if  P is  at  even  depth,  (6.9) 

ex  - max{(x’,-B}  , fi  - min{/3’,-^}  , if  P is  at  odd  depth.  (6.10) 

The  procedure  AB,  below,  is  a modification  of  the  procedure  ALPHABETA,  in  which  the 
bounds  alpha  and  beta  are  regularly  updated  according  to  the  relations  (6.9)  and  (6.10)  to 
take  into  account  the  changes  of  the  two  variables  CALPHA  and  CBETA, 


integer  procedure  AB(position  P,  integer  alpha,  integer  beta,  boolean  even): 
begin 

determine  the  successor  positions:  P/, P-> 
if  n - 0 then 
AB  .-  PP) 


else 


done: 

end 


begin 

for  J ;■  I step  / until  n dp 


t ;m  -AB(P;,-beta, -alpha, not  even)) 
if  t > alpha  then  alpha  :■  ti 
If  even  then 

(alpha  :>  max  {alpha, C ALPHA):  beta  : 


else 


(alpha  max{alpha,-CBErA}{  beta 
li  alpha  i beta  then  goto  done 
end; 

AB  alpha 
end 


minfbeta, CBETA)) 
mln{beta,-C  ALPHA))! 


A modified  Alpha-Beta  procedure 
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Relations  similar  to  the  relations  (6.1),  (6.2)  and  (6.3)  hold  for  the  procedure  AB  as 
well.  Consider  the  call; 

t^r-  ABlA'.ar./l.true)  . (6.11) 

and  as  before  define  Vq  MINIMAX(P).  Also,  let  A and  B denote  the  values  of  the  two 
variables  CALPHA  and  CBETA  when  returning  from  the  call  (6.11)  (i.  e.,  as  of  the  last  time 
they  are  used  during  the  execution  of  the  call).  For  A’  and  B'  satisfying  A'  i A and  B’ s B, 
define  of'  » max{«,yi’)  and  /i'  «»  nun{/J,B'}.  We  have  the  following. 

Theorem  6.2 

With  the  above  notations,  provided  that: 

A'iV^iB'  and  ex' < fi’ , 
we  have: 

if  V & ex'  then  Vq  i ex' , 

if  Of'  < V < fi'  then  V(f  m V , 

if  Vi  fi'  then  VQifi' . 

Proof: 

The  proof  follows  easily  (by  induction  on  the  depth  of  node  P)  from  the 
relations  (6.1),  (6.2)  and  (6.3)  and  the  relations  (6.9)  and  (6.10).  ■ 

Program  B,  below,  directly  implements  the  relations  stated  in  this  theorem.  Since 
the  analog  of  Theorem  6.1  can  be  proved  for  Program  B as  well,  Its  correctness  Is  a direct 
consequence  of  Theorem  6.2. 

Program  B: 

global  integer  GALPHA,  CBETA] 

Initialization: 

begin 

CALPHA  -eo;  CBETA 
start  processes  P^, Pj^ 
en^ 


& 


i 

i 
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Process  P •: 
begin 


integer  Aj.-B:.  V 

{(A:,  Bj)  SELECTNEWINTERVAL)} 


winte  Aj  < Dj  ^ 
begm 


Vj ;»  AO(Rool,/^;,B;,lrue)j 

/r 

begin 


m»x{A  j.CALFHA);  Bj min(B  .-.GBErAh 
< e,  Ihen  •' 


ti  Vj  i Aj  then 
begin 

{CBE.TA  :m  min{CBETA,Aj^l)-, 

(Aj.  Bj)  SELECTNEWINTERVAL) 
cno 

else 

if  Vj  i Bj  then 
oegin 

{CALPHA  mgx(C ALPHA, B j-l)i 
(Aj,  Bj) SELECTNEWINTERVAL) 
end  ■' 

begin 

{CALPHA  GBETA  Vj)i 
return  the  solution;  Vji 
terminate 
end 


else 


end 

else 

{(Aj.  Bj)  SELECTNEWINTERVAL) 
end;  ' ^ 

terminate 
end 


Procedures  ALPHAQETA  and  AB  implement  two  extreme  alternatives  in  which  the 
bounds  alpha  and  beta  are  never  updated  and  in  which  they  are  updated  each  time  they 
are  used.  A more  efficient  implementation  would  be  to  update  alpha  and  beta  only  when 
changes  have  been  made  on  the  variables  CALPHA  and  GBETA.  This  can  be  achieved  very 
easily  by  introducing  a global  counter  incremented  by  1 inside  the  critical  section  after 
each  of  the  instuctions  of  Program  B modifying  CALPHA  and/or  GBETA,  and  by  introducing 
a counter  local  to  each  process  to  check  if  the  latest  modifications  of  CALPHA  and  GBETA 
have  been  taken  into  account.  Since  the  counters  can  only  increasci  no  additional  critical 
section  is  required.  We  will  not  present  the  implementation  details,  but  the  point,  here,  is 
mainly  to  show  that  it  Is  possible  to  implement  (at  a very  low  extra  cost)  each  process  so 
that  It  is  continuing  a partial  search  only  if  the  result  of  the  search  can  produce  the 


I 


J 


( 


PART  2;  PARALLEL  ALPHA-BETA  PRUNING  ALGORITHM 


113 


I 


solution  or,  at  at  least,  a rndnction  of  the  interval  in  which  the  solution  can  lie.  In 
particular,  we  note  that,  in  Program  B,  process  Py  will  terminate  its  search  as  soon  as,  for 
example,  GALPHA  i Bj  or  GBETA  s Aj,  oilhor  condition  ruling  out  the  original  interval 
<Aj,  Bj).  This  properly  will  be  taken  into  account  in  the  analysis  presented  in  Section  7. 

7 - Analysis  of  the  parallel  od-/?  pruning  algorithm 

We  will  proceed  in  Ihis  section  to  the  analysis  of  the  parallel  algorithm  described  in 
the  preceding  section.  Since  the  algorithm  is  organized  around  parallel  executions  of 
partial  taarches,  it  is  the  first  thing  we  want  to  analyze.  Most  of  this  analysis  differs  very 
slightly  from  the  analysis  developed  in  Sections  2 and  3,  and  we  will  only  present  in 
Section  7.1  and  7.2  the  main  results  leading  to  the  evaluation  of  a partial  search.  The 
overall  evaluation  of  the  algorithm  depends  upon  the  procedure  SELECTNEWINTERVAL  and 
will  be  derived  in  Section  7.3. 

7.1  - Condition  for  a node  to.be  examined  under  a partial  search 

$ 

As  in  Section  2,  let  3 “ Jj-  •-  Jd  denote  a node  at  depth  ef  In  a game  tree  and,  for 

0 i i i d-I,  let  3^  jj Jd-i-  notations  for  u(3)  and  c(3)  remaining  the  same,  we 

now  define: 

oi'(3)  » max{  c(3d-i^  I * 1 S * s d ) , 

- niax{  z(3d-i^  I ‘ is  even,  1 s i s d } . 

Given  the  two  bounds  a and  6,  we  also  define: 

A’(3)  ^ oiax{  a,  oi'(3)  ) , 

B'Q)  - maxi  -b,  ftV)  } . 

The  analog  of  Theorem  2.1  for  a partial  search  can  now  be  stated  In  the  following. 
Theorem  7.1 


Assume  that  the  root  of  a game  tree  is  explored  through  the  call 
ALPHABETA(Rool,«,/t) 
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by  some  process  executing  the  parallel  procedure  of  Section  6.1.  Then,  with  the 
above  notations  and  provided  that  a < 6,  an  arbitrary  node  ^ of  the  game  tree  will  be 
subsequently  explored  if  and  only  if: 

A'Q)  * B’(J)  < 0 . (7.1) 

Proof: 

The  proof  is  immediate  by  induction.  | 

Observe  that,  when  the  procedure  AB  of  Section  6.2  Is  used  Instead  of  the 
procedure  Al.PHABETA,  condition  (7.1)  only  remains  a necessary  condition  for  node  J to  be 
explored  through  the  call  AB(Rool,cv,/5).  It  is  no  longer  a sufficient  condition  since,  by 
updating  the  bounds  (v  and  /?  during  the  execution  of  the  procedure  AB,  additional  pruning 
might  occur. 

In  the  following  evaluation  of  a partial  search  we  will  assume  that  the  process 

executes  the  procedure  ALPHABETA,  and  we  will  utilize  condition  (7.1)  to  characterize  the 

fact  that  node  J is  explored. 

/ 

7.2  * Average  number  of  nodes  explored  under  a partial  search 
•*  • 

As  before,  we  wilt  consider  a rug  tree  of  degree  n and  depth  d,  and  we  will  assume 
first  that  the  bottom  values  are  independent  identically  distributed  random  variables 
distributed  according  to  some  discrete  probability  distribution 

Is  the  probability  that  a bottom  value  be  assigned  the  value  Xj^  » k/m,  for  -m  i k & m. 

Given  two  bounds  a and  /?,  we  define  kj  and  by: 

^-*kt>  ^‘'*k2- 

Since  the  values  tx  and  ft  could  bo  unbounded,  It  Is  convenient  to  define 

« *00.  Throughout  we  will  only  consider  the  partial  search  corresponding  to  the  call 
ALPHABETA(noot,a’,^),  and  we  will  assume  that  tx  < ft,  which  can  equivalently  be  expressed 
as  -m-l  s fcj  < k2  i m*I. 


r - ^ - — 

\ 
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Ujilng  argumonts  Iclonlical  to  those  of  Section  3.1,  the  prohabitlty  distributions  for  * 

the  quantities  A'(^)  and  can  be  obtained  immediately  as  a function  of  the  quantities 

for  0 i i & d and  -m-I  i k a ni.  Then  the  probability  k(^)  that  some  node  ^ of  the 
game  tree  be  explored  under  a partial  search  can  be  derived  from  these  results  using  the 
characterization  given  by  condition  (7.1).  As  with  Theorem  3.1,  the  following  theorem 
results  directly  from  the  expression  for  v(J).  In  order  to  present  a uniform  result 
(independent  of  the  parity  of  d)  in  this  theorem,  we  depart  slightly  from  the  notations  of 
Section  3.1,  and  the  products  denoted  by  TT^  and  TT^  are  now  extended  over  all  even  and 
odd  integers  i,  respectively,  in  the  range  I i i i d. 

i 

1 

j Theorem  7.2: 

The  average  number,  of  bottom  positions  examined  under  a partial 

search  is  given  by: 

- TT^ 

* kj*jLk2-l  * TT,  o-^.^(k)  . (7.2) 

Proof: 

As  with  the  proof  of  Theorem  3.1,  the  result  follows  directly  by  summing  the 
probabilities  ff(J)  over  all  terminal  positions  . 

When  assuming  that  all  bottom  values  are  dislribuled  according  to  some  continuous 
probability  distribution  (or,  similarly,  are  all  distinct),  again  we  can  obtain,  as  in 
Section  the  average  number  of  bottom  positions  examined  under  a partial  search  by 
considering  the  limit  of  in  equation  (7.2).  At  this  point  it  is  convenient  to 

consider  the  cumulative  distribution  for  the  value  v(Root)  with  respect  to  the  two  points  a 
and  ft.  Namely,  given  the  probability  distribution  {pjj'^A-nt.s.kim.  equivalently 
{P0<k)].„^^l,^„i)  end  given  u - and  ft  - X|^^,  we  introduce: 

®/h  * t - Pd(-ki-l) , 


If,  in  general,  we  let: 

t ■ p^-m)  * ...  ♦ Pd(k)  ml-  pj(-k-l) , 
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and  define  in  an  obvious  way  the  functions  P and  Q on  [0»  1]  by  the  correspondence: 

p(t)  - 

0(t)  - 

we  can  state  the  limit  of  equation  (7,2)  in  the  following  theorem. 


Theorem  7.3: 

Provided  that: 


Urn  max{  pn(k)  | -m  £ 1:  ::  m } - 0 
ni~*oo 


and  that; 


Um 

ni-*<x> 


-m 


m ft  , Um  6-,  - 6 , 


/?i-»co 


the  limit  of  when  m <o,  is  given  by: 

N.Ja,h)  - P(a).0(a)  * /^  P'(t).0(t).dt . 

/% 


(7.3) 


Both  Theorem  7.2  and  Theorem  7.3  provide  us  with  a cost  of  executing  a partial 
search,  measured  by  the  number  of  terminal  positions  examined  during  the  search,  when 
the  bottom  values  are  distributed  according  to  either  a discrete  or  a continuous 
probability  distribution. 


In  Figure  7.1,  we  have  plotted,  for  * £ (0,  1],  the  two  quantities 
C(x)  - P(x).0(x), 

H(x)  - /*  P’(t).Q(t).dt  . 

Wc  deduce  from  equation  (7.3)  that  N^^^a,b)  can  be  expressed  directly  from  these  two 
quantities  as: 

N^  j(a.b)  C(a)  * H(b)  - H(a)  , 

with  an  immediate  interpretation  in  Figure  7.1.  If  we  consider  the  cose  when  the  bottom 
values  are  distributed  according  to  a discrete  probability  distribution,  then  as 

given  by  equation  (7.2),  can  bo  expressed  similarly  as  a function  of  ft^j  and  6^^.  The 
functions  C and  H arc,  in  this  case,  simply  replaced  by  step  functions,  which  coincide  with 
the  continuous  functions  C and  H at  the  points  m 1 - p^-k-i),  for  -m  i k £ m. 
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C(x),  H(x} 


20 


15 


■ill 
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Figure  7.1  - An  interpretation  for 

7.3  ~ The  anelytlB  of  the  parallel  «i-fi  pruning  algoriihm 

The  results  of  Section  7.2  show  that  the  cost  of  executing  the  partial  search 
corresponding  to  the  call 

ALPHABETAiRool.w,/?) 
can  be  expressed  by; 

c(a,b)  - C(a)  ♦ [H(b}  - H(a)] , 
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with 

a - Probaj  V i u ) , fc  - Proba{  V i ) , 

where  V is  the  random  variable  representing  the  value  backed-up  to  the  root  of  the  game 
tree  (by  the  MINIMAX  procedure).  Given  the  probability  distribution  for  the  random 
variable  V,  we  have  a one-to-one  correspondence  between  intervals  (a,  of  (-00,  *co)  and 
intervals  (a,  b)  of  (0,  t).  Using  this  correspondence,  we  wilt  only  talk  in  the  following 
about  partial  searches  over  intervals  of  (0,  t). 

Although  the  two  functions  G and  H are  readily  computed  numerically,  they  do  not 
lend  thomselyes  very  easily  to  analysis  and,  in  the  remainder  of  the  section,  we  will 
consider  an  approximation  suggested  by  Figure  7.1.  We  notice  in  the  example  depicted  in 
this  figure  that  C(x)  remains  nearly  conslanl  when  * varies  in  the  Interval  [0,  i]  and  that 
Hfx)  varies  almost  linearly  on  the  same  interval.  While  the  numerical  results  presented  in 
Figure  7.1  correspond  to  a partial  search  of  a rug  tree  of  degree  n - 3 and  depth  d » 6, 
numerical  results  obtained  with  other  values  of  n and  d actually  show  that  the 
approximation  of  G by  a constant  and  of  H by  a linear  function  is  even  better  for  large 
values  of  n and  d.  This  is  especially  true  in  an  open  interval  contained  in  [0,  /].  In 
consequence,  we  will  assume  in  the  following  that  the  cost  of  executing  a partial  search 
over  any  interval  (d,  b)  of  (0,  /]  is  exactly  given  by: 

c(a,b)  - p ♦ - a]  , (7.4) 

where  p and  <7  only  depend  on  the  rug  tree  Itself  (i.  e.,  on  n and  rf).  Numerical  results,  not 
presented  here,  have  been  run  for  n • 3,  4,  8,  16  and  32  and  for  2 i d i 8,  it  turns  out  that, 
if,  obviously,  p and  q are  very  dependent  on  n and  d,  the  ratio  X - p/q  does  not  show  a 
large  variation  and  lies  typically  in  the  range  0.2  i \ i 0.4. 

Without  loss  of  generality,  we  will  normalize  the  cost  c(a,b)  of  equation  (7.4)  by 
assuming  that  q - J (hence  p ■ X)  and  we  will  consider  throughout  that: 

e(a,b)  m X * b - a , 
or,  equivalently,  with  b * a * h,  that; 


e(a,a*h)  - X * k . 


(7.5) 
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This  cost  will  also  be  taken,  in  the  following  section,  as  the  time  for  a process  to  execute 
a partial  search  over  the  interval  (a,  b)  - fa,  a*h). 


7.3.1  - An  analysis  of  the  parallel  iinplomontalion;  Optimal  decomposition 


Given  the  cost  of  a partial  search  through  equation  (7.5),  we  will  determine  in  this 
section  the  optimal  decomposition  of  the  interval  (0,  /]  and,  with  this  result,  the  optimal 
procedure  SELECTNEWINTERVAL,  Introduced  in  Section  6.1  for  k i 2,  processes  can  be 
defined. 

As  an  example,  we  first  examine  Iho  special  case  when  the  interval  [0,  I]  is  split 
into  k subintervals  I ...,  I|^  searched  in  parallel  by  processes  Pj, ...,  P|^,  respectively.  Let 
be  the  si^e  of  J^,  for  i - /, ...,  k,  with  sj  ♦ Under  this  decomposition,  process 

will  find  the  solution,  with  probability  after  a cost  A ♦ Therefore,  the  average 
cost  (or  time)  to  find  the  solution  is,  in  this  case,  simply  given  by: 
t » ij.fX  * SjJ  * •••  ♦ 
m X * sj  * ...  ♦ , 

for  which  the  minimum,  Tq,  is  achieved  when  for  i • 1, ...»  k (recall  thal 

yields: 

The  decomposition  of  the  interval  [0,  1]  presented  in  this  example  is  the  simplest  one,  and 
it  does  not  allow  any  feedback  between  the  processes  since  the  k partial  searches  cover 
the  whole  interval  (0,  ]].  The  example  confirms,  however,  the  obvious  fact  thal,  in  order 
to  achieve  the  minimum  cost,  the  k subintervals  searched  by  the  k processes  shoutd  be  of 
equal  length. 

In  order  to  introduce  some  feedback  between  the  processes,  we  now  consider  a 
further  decomposition  of  the  interval  (0,  1]  Illustrated  In  the  diagram  of  Figure  7.2  in  the 


case  of  two  processes. 
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I — ( — mmm — i — — i — i 
0 a b c d 1 

Figure  7.2  - A decomposition  of  [0,1] 

The  two  processes  Pj  and  Pp  start  exploring  In  parallet  the  two  subintervals  [a,  fc]  and 
[c,  d],  respectively.  If  either  process  finds  the  solution  at  the  completion  of  this  first 
search,  with  probability  (b-a)  or  (d-c),  the  execution  terminates  with  a cost  of  either 
(X*b-a)  or  (X*d-c).  Otherwise,  consider  that  process  Pj  finishes  first.  If  it  finds  out  that 
the  solution  lies  in  the  inlerval  [0,  a],  wc  know  lhal,  with  the  implementation  proposed  in 
Section  6.2,  process  P2  will  terminate  its  search  immediately  afler  and,  therefore,  both 
processes  can  start  simultaneously  new  partial  searches  within  the  inlerval  (0,  a].  If,  on 
the  other  hand,  process  Pj  finds  out  that  the  solution  lies  in  the  Interval  [6,  i],  it  will 

a 

start  arbitrarily  a partial  search  over  an  interval  within  [6,  e]  or  [ef,  /]  while  waiting  for 
process  P2  to  complete  its  initial  partial  search  and,  possibly,  will  readjust  its  search  as 
soon  as  process  P2  finishes.  If  we  assume  that  both  intervals  [a,  b]  and  [e,  d]  are  of  equal 
length,  both  processes  will  finish  their  initial  searches  roughly  at  the  same  time.  We  will 
neglect  In  the  following  the  delay  involved  in  making  the  decision  as  to  which  subinterval 
actually  contains  the  solution,  and  we  will  assume  that,  if  the  solution  has  not  yet  been 
found,  the  processes  restart  a new  partial  search  simultaneously. 

According  to  this  decomposition,  k subintcrvals  are  Initially  searched  by  the  k 
processes  and.  If  the  solution  is  not  found  during  this  first  trial,  it  is  known  to  lie  in  I of 
k*t  subintervals  depending  upon  the  outcomes  of  the  first  partial  searches.  Thus  k 
subintervals  will  bo  searched  during  the  second  trial  out  of  a total  of  k(k*I)  possible 
subintervals.  In  general,  if  not  successful  after  the  i-th  trial,  the  k processes  will  start 
simultaneously  k new  partial  searches  over  - k(k*l)^  possible  subintervals  during  the 
trial. 

Let  hg  m I,  and,  for  i <*  t,  2, ...,  let  be  the  total  length  of  the  Interval  [0,  I]  that 
still  could  be  explored  afler  the  i-th  trial.  Then,  for  i - 1,  2, ...,  measures  the 


i 


PART  2:  PARALLEL  ALPHA -BETA  PRUNING  ALGORITHM  121 

total  length  of  all  a-_i  subinlervals  that  could  be  searched  during  the  t-th  trial.  It  also 
rr.eacures  the  probability  that  the  solution  be  found  at  that  lime  after  a cost  given  by: 

Cj  - (^  ♦ ~ * [^  ♦ > 

assuming  that  the  subinlervals  that  could  be  searched  during  the  t-th  trial  have  all 
the  came  length: 

The  total  average  cost,  T,  follows  Immedialely.  Wo  have: 

T - Z (h-.j  - , 

T - Z iX(h..t  - h.)  t Z - V (hj.,  - hj)/aj.j] , 

^ ■ \fo  '’i  * .fo 

The  following  theorem  states  the  optimal  decomposition  leading  to  the 

minimum  average  cost  of  expression  (7.6).  For  k i 2,  we  will  consider  the  following 
sequence  of  Intervals  (recall  that  aj  « k(k*i)^): 

Aj  - {l/ay  (k-D/aj)  . for  j - 1,  2,  , 

and 

Bj  - [(k-D/aj,  iMj.j)  , for  j - I,  2, ... . 

Theorem  7.4: 

Assume  k i 2,  and  lei  C/^CX)  denote  the  minimum  of  expression  (7.6)  over  all 
possible  decompositions 

(a)  If  X C Aj,  for  some  J » 0,  1 the  minimum  of  expression  (7.6)  Is  achieved  (or; 

Hq  » ...  * hj  * 1 and  ^ j^\  * ^J*2  * *”  " ^ * 
yielding: 

c,^a)  - Ot/A*Z, 

(b)  Otherwise,  If  X E B j,  for  some  j - 1,  2, .«,  the  minimum  Is  achieved  for:  . 

Hq  ^ m hj.j  " I , “ 2**^ J ~ ^j*i  “ ^y*2  “ ® • 


yielding: 


^j-i  4 ' 


- jx 
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Observe  first  that  the  decomposition  {\lj^0  satisfies: 

1 ~ hQi  hj  i ...i  h^.i 

Assume  that  X 2 f/oy,  for  some  j 2 0.  Given  any  decomposition  consider 

another  decomposition  defined  by: 

' li  ii  j , 

Bi  - ' 

^ 0 U i i J*I  , 

and  let  T'  denote  the  expression  (7.6)  where  is  replaced  by  We  have: 

r - T’  ~ X Z ft:  * Z ■J-h:(h:-  ft:.,) 
iij*i  * iij*t  H t ‘ ‘*1 

* ^ * aj^  ^j*t  ^^j*I  ~ 

> 0 • 0 • ^ 

“ o^y  ^j*t  ~ ^ 0 , 

which  shows  that  T is  minimized  when  > 0 for  i 2 J*!. 

Assume  now  that  X < (k-U/aj  for  some  j 2 f.  Assume  furthermore  that  - I for 
some  i,  1 i i i j (recall  that  ftQ  m I).  We  have: 

X < (k-l)/ctj  i (k-l)/a^ , 

which  shows  that  the  derivative,  of  T with  respect  to  fi^  satisfies: 

* X - s-r-T  - A;* 


- 


a.  " aJZJ  o^  ’i*l 


< - 2 - hj)  - ^ S 0 


This  last  Inequality  shows  that  T decreases  when  increases  from  0 to  i and  that, 

therefore,  the  minimum  of  T is  achieved  when  - i.  Since  Hq  > /,  we  have  shown 

part  (a)  of  the  theorem. 

/ 

Assume  now  that  X C By  for  some  j 2 i,  1.  e.: 

(tr-t)/aj  i X < t/<kj_f  . 

In  particular,  since  ft  2 2,  X 2 l/ttj  and  X < (k-l)/aj.j.  It  follows  from  the  above  proof  that 
Hq  • ...  m hj_j  m I and  that  ftj^j  ■ Ay^2  “ •••  ■ 0.  Hence,  expression  (7.6)  becomes; 

T m jX  * - (gJ—-X)ftj  * . 
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The  various  curves  of  Figure  7.3  represent  the  speed-up  - C(X)/Ci^(X)  achieved  by 

the  paraltel  implementation  with  k processes  over  the  original  algorithm  (or  k • 2,  3,  4 and 
for  the  limiting  case  fc  » oo.  In  this  latter  case  l/ag  simply  reduces  to  0 and  we  always 
have  X C Aq^  It  follows  from  Theorem  7.4  that  C^^fX!  - X and  therefore 

S (X)  ~ m J * i. 

« X X 

7.3.2  - Implications  of  the  results  and  validity  of  the  assumptions 

Let  us  examine  the  results  of  the  preceding  section  as  Illustrated  in  Figure  7.3.  We 
noticed  earlier  that  the  Initial  cost  of  a partial  search,  X,  typically  lies  in  the  range 
(0.2,  0.4).  We  observe  from  Figure  7.3  that  when  k ^ 2,  for  example,  the  parallel 
implementation  can  improve  upon  the  original  (sequential)  pruning  algorithm  by  a 
factor  which  can  be  larger  than  2 when  X lies  in  the  range  of  practical  Interest.  Moreover, 
when  X becomes  small,  the  improvement  actually  becomes  unbounded,  as  can  be  seen  by 
choosing  X » I/aj  for  which  we  have:  ■ (aj  ♦ t)/(j  ♦ 2).  An  Immediate  consequence 

of  the  results  of  Section  7.3.1,  therefore,  is  that  the  u-fl  pruning  algorithm  (as  described 
in  Section  2)  is  not  optimal.  The  same  strategy  used  for  the  parallel  implementation  with 
two  or  more  processes  is  obviously  also  suitable  to  the  case  of  only  one  process,  and,  in  a 
similar  fashion,  we  can  deduce  an  optimal  decomposition  of  the  interval  [0,  J}  for  this  case 
as  well.  Although  the  results  of  Theorem  7.4  are  not  applicable  for  the  sequential  case 
(only  the  first  part  of  the  proof  is  relevant  when  k - 1),  simple  calculus  shows  that  when 
X € (0.2,  0.4)  an  improvement  between  15/C  and  257.  can  be  achieved  over  the  original 
algorithm,  and  this  constitutes  a substantial  gain. 

The  analysis  developed  in  Section  7.3.1  relies  Implicitly  on  the  Knowledge  of  the 
distribution  for  the  value  Vq  backed-up  to  the  root  of  the  game  tree.  In  particular,  when 
we  state.  In  Theorem  7.4,  the  optimal  decomposition  of  the  Interval  [0,  /]  in  terms  of 
we  really  need  to  Know  the  distribution  of  Vq  to  actually  Implement  the  procedure 
SELECTNEWINTERVAL  according  to  this  optimal  decomposition.  When  nothing  Is  known 
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The  cost  associated  with  the  actual  clocomposition  is  easily  evaluated  and  is  given 
by: 

r ■ p.(X  * p)  * x.(X  * p * X * x)  * (I  - X - p).(X  *p*X*l-x-p) 

- (2  - p)  X * p * x^  ♦ (1  ~ p - x)^  , 

from  which  we  deduce  that  the  worst  case,  achieved  (or  * - 0 or  * ■ ] - p,  is  given  by: 

T2  - (2X*  i)  - a*  1)  p ♦ p^ , 

corresponding  to  T2  * 1.24  when  Xml/3  and  p « 0.8.  Although  this  worst  case  stilt 
corresponds  to  an  Increase  of  11.67.  over  the  optimal  cost,  it  is  an  Improvement  of  77.  over 
the  cost  of  the  original  algorithm.  Yet,  in  view  of  the  optimal  case,  one  could  think  of 
improving  the  cost  by  reducing  the  first  interval  so  as  to  have  p - f/3,  but  then  this  would 
increase  the  worst  case,  which  would,  in  fact,  correspond  in  this  case  to  the  cost  of  the 
original  algorithm,  therefore,  showing  no  improvement.  (Looking  at  the  best  case, 
.however,  .we  could  achieve  the  optimal  case  in  this  way,  but  only  with  the  risk  of 
aggravating  the  worst  case.) 

The  results  we  have  developed  rely  on  several  simplifying  assumptions,  and  we 
would  like  to  conclude  this  section  by  examining  their  validity.  While  equations  (7,2) 
and  (7.3)  provide  us  with  the  exact  cost  of  a partial  search  over  some  Interval  (a,  /?)  (or 
fa,  b)  equivalently),  measured  by  the  number  of  terminal  positions  examined  during  the 
search,  we  have  used  the  approximation  given  by  equation  (7.5)  to  derive  the  results  of 
Section  7.3.1.  As  we  have  mentioned,  however,  this  approximation  seems  to  be  reasonable 
and  more  and  more  accurate  as  the  game  tree  becomes  larger,  and  we  do  not  (eel  that  this 
approximation  leads  to  a large  error  in  the  analysis.  In  order  to  check  on  the  validity  of 
this  approximation,  however,  we  have  run  a series  of  simulations  and  compared  the  results 
with  the  results  predicted  by  Theorem  7.4,  where  X was  computed  numerically  by  using  a 
least  square  approximation  to  the  functions  C(x)  and  H(x)  on  the  Interval  [0,  i]  (see 
Figure  7.1).  The  simulation  results  were  very  consistent  with  the  analytical  results  and 
showed  an  actual  improvement  over  the  original  algorithm  between  5%  and  i07.  better  than 
the  Improvement  predicted  by  the  theory. 
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The  simulation  was  also  aimed  at  s'orifying  another  simplifying  assumption  we  have  $■ 

used  in  the  analysis.  White  equation  (7.5)  provides  us  with  the  unconditional  average  cost 
of  a partial  search  over  an  interval  Co,  h),  what  we  really  need  to  derive  equation  (7.6)  is 
the  cost  of  a partial  search  over  an  interval  (a,  b)  conditionned  by  the  fact  that  the 
sotulion  lies  in  some  interval  (a\  b‘)  (possibly  the  same  interval).  Here,  too,  the 
simulation  results  were  useful  lo  validate  this  simplifying  assumption. 

8 - Conclusions  and  open  problems 

We  have  presented  in  the  first  pari  of  the  chapter  an  analysis  of  the  performance  of  ; 

the  K-fi  pruning  algorithm  for  searching  a uniform  tree  of  degree  n and  depth  d when  the 
values  assigned  to  the  terminal  nodes  are  independent  identically  distributed  random 
variables.  The  analysis  lakes  into  account  both  shallow  and  deep  cut-offs,  and  we  have 
also  considered  the  effect  of  equalities  between  the  values  assigned  to  the  terminal  nodes. 

A simple  formula  was  derived,  in  Section  3,  lo  measure  the  number  of  terminal 
nodes  examined  by  the  «-/?  procedure  when  the  bottom  values  are  drawn  from  a finite 
range  according  to  an  arbitrary  discrete  probability  distribution.  Although  the  formula  can 
be  easily  computed  numerically,  a direct  analysis  is  made  difficult  by  the  presence  of  the 
probability  distribution.  In  the  case  when  only  two  distinct  values  can  be  assigned  to  the 
terminal  nodes,  it  is  shown  that,  by  choosing  appropriately  their  probability  distribution, 
the  number  of  terminal  nodes  examined  by  the  k-/3  procedure  can  grow  at  least  as 
0[(n/ln  n)*^],  which,  in  fact,  corresponds  lo  the  worst  case  behavior  of  the  algorithm  (over 
all  possible  probability  distributions). 

A formula  was  then  presented  in  the  form  of  an  integral  to  measure  the  number  of 
terminal  nodes  explored  by  the  k~/3  procedure  when  the  bottom  values  are  all  distinct.  An 
analysis  of  the  integral  shows  that  the  branching  factor  of  the  v-fi  pruning  algorithm  Is 
©Cn/ln  n),  a result  which  confirms  a claim  by  Knuth  and  Moore  (35]  that  deep  cut-offs  only 
have  a second  order  effect  on  the  behavior  of  the  e/-/}  pruning  algorithm. 
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Wo  think  that  the  main  contribution  of  this  anatysis  is  to  give  a better  unctcrstanding 
of  the  O'-/}  pruning  atgorithm.  In  particutar,  we  have  shown  that  the  a priori  unreatistic 
assumption  that  att  the  vatucs  assigned  to  the  terminal  nodes  of  a game  tree  be  distinct 
corresponds,  in  fact,  to  the  worst  case  performance  of  the  algorithm.  Moreover,  we  have 
shown  that  this  worst  case  performance  can  be  attained  even  in  the  very  simple  case  when 
the  bottom  values  can  only  take  on  two  distinct  values,  by  choosing  appropriately  their 

probability  distribution.  We  think  that  this  can  be  important  in  practice  because,  it  is 

» 

relatively  easy  in  most  game  playing  programs  to  obtain  (by  inspection  of  the  evaluation 
function)  an  accurate  bound  for  the  range  of  distinct  values  assigned  to  the  various 
positions  of  the  game,  but  it  is  usually  not  so  easy  to  derive  a good  estimate  for  the 
probability  distribution  of  these  values. 

Similarly,  the  branching  factor  analyzed  in  Section  5 provides  us  only  with  an 
asymptotic  measure  of  performance  for  the  a-fi  pruning  algorithm  (i.  e.,  for  trees  of  targe 
depth).  As  indicated  by  the  results  of  Section  3.3,  however,  the  branching  factor  can  also 
be  used  as  a realistic  measure  of  the  worst  case  even  for  small  trees. 

Wo  have  measured  the  efficiency  of  the  u-/i  pruning  algorithm  by  the  average 
number  of  terminal  nodes  explored  during  the  search.  It  would  be  interesting  to  also 
obtain  an  estimate  for  the  standard  deviation  of  this  number. 

The  scheme  we  have  considered  for  assigning  values  to  terminal  nodes  of  a uniform 
tree  lent  itself  easily  to  analysis;  it  is,  however,  very  simplistic.  Different  schemes  for 
assigning  static  values  have  been  proposed  in  [23],  [35]  and  [^5].  Analyses  of  these 
schemes  would  be  helpful  for  various  applications;  a step  in  this  direction  was  presented 
In  [45]  for  game  trees  of  depth  2 and  3. 

• . 

In  the  second  part  of  this  chapter  we  have  investigated  the  possibilities  of 
implementing  the  et-ft  pruning  algorithm  in  parallel.  Due  to  the  intrinsically  sequential 
character  of  the  algorithm,  it  seems  difficult  to  achieve  a high  efficiency  with  a parallel 

I 
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implementation  bared  on  a direct  relormulation  of  the  original  algorithm.  Rather  than 
having  the  processes  search  in  parallel  various  sublrees  of  a game  tree  for  the  solution, 
we  have  proposed,  in  Section  6,  a parallel  implementation  in  which  the  processes  work 
independently  by  searching  Ihc  entire  game  tree  for  the  solution  over  disjoint 
subintervals.  The  idea  is  similar  to  the  notion  of  aspiration  Level  implemented 
(sequentially)  in  the  Technology  Chess  Program  [24],  [25]. 

In  Section  7,  we  have  developed  an  analysis  of  our  parallel  implementation  of  the 
rv-/!?  pruning  algorithm,  and  Theorem  7.4  stales  an  optimal  sequence  of  intervals  (which 
depends  on  the  degree  k of  parallelism,  1.  e.,  the  number  of  processes  cooperating  in  the 
search)  for  minimizing  the  average  cost  of  the  algorithm.  It  follows,  in  particular,  that, 
when  Ihe  degree  of  parallelism  k is  small  {k  = 2 or  3),  Ihe  parallel  algorithm  shows  an 
improvement  over  the  original  algorithm  by  a factor  which  is  larger  than  k.  A surprising 
consequence  of  the  results,  therefore,  is  that  the  ar-/I  pruning  algoritiim  is  not  optimal. 
This  fact  has  been  confirmed  through  a scries  of  simulations,  and  for  a typical  tree  (with  a 
degree  of  about  30,  and  a depth  of  about  5)  the  results  show  that  the  tv-/?  pruning 
algorithm  can  be  improved  by  157,  to  257.  It  is  to  be  noted  that  these  figures  are  very 
consistent  with  empirical  measurements  of  the  Technology  Chess  Program  [25]  showing 
that  the  implementation  of  the  aspiration  level  reduces  the  search  by  237. 

The  analysis  we  have  developed  relies  on  several  simplifying  assumptions,  and  it 

would  be  interesting  to  develop  a more  accurate  analysis,  for  example,  by  using  a closer 

approximation  for  the  cost  of  a partial  search,  or  by  evaluating  the  cost  of  a partial  search 

over  some  interval  {a,  b)  given  lhal  the  solulion  lies  in  some  inlerval  Ca’,  b').  The  analysis 

could  also  be  refined  by  not  assuming  lhal  the  processes  cooperating  in  the  search  restart 
/ 

new  partial  searches  simultaneously. 

Although  the  parallel  implementation  we  have  proposed  appears  to  be  efficient  with 
a small  number  of  processes,  the  maximum  speed-up  achievable  is  limited  typically  to  5 or 
6 (see  Figure  7.3  with  k - n>).  We  feel  that  a belter  way  to  implement  in  parallel  the 
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ft*-/?  pruning  algorithm  with  a large  number  of  procesr.es  would  be  to  combine  both  the 
strategy  of  decomposition  we  have  proposed  and  the  independent  exploration  of  different 
subtrees  of  the  entire  game  tree.  For  example,  we  could  have  two  groups  of  processes, 
each  group  executing  a partial  search  over  a different  subinterval,  and  each  process  in  a 
group  exploring  a different  subtree.  We  think,  however,  that  the  results  are  very 
important  and  should  be  used  syslematically  in  a sequential  implementation,  in  conjunction 
with  some  dynamic  evaluation  of  the  probability  distribution  of  the  value  of  a game  tree. 


Chapter  V 


Experimental  Results 
with  Asynchronous  Multiprocessors 


1 - Introduction 

By  simulaling  a multiprocessor  system,  Rosenfeld  [52]  and  Rosenfeld  and 
Driscoll  [53]  have  reported  a series  of  results  to  measure  the  effectiveness  of 
programming  an  asynchronous  multiprocessor  for  the  solution  of  the  Dirichlet  problem 
using  chaotic  iterations  [11].  The  problem  consists  of  solving  the  set  of  linear  equations 
associated  with  Laplace's  equation  through  the  method  of  finite  differences. 

In  this  chapter,  we  describe  a series  of  experiments  in  which  various  asynchronous 
iterative  methods  (see  Chapter  lit)  are  implemented  on  an  asynchronous  multiprocessor 
(C.mmp  under  the  operating  system  Hydra  [63],  [64])  to  solve  the  Dirichlet  problem.  Wo 
first  present  the  results  of  measurements  obtained  with  these  experiments.  We  then  show 
how  very  simple  techniques  from  order  statistics  (see,  for  example,  [14])- and  from 
queueing  theory  (see,  for  example,  [33])  can  be  used  effectively  to  explain  and  predict 
with  a fair  accuracy  the  experimental  results. 

In  Section  2,  we  briefly  describe  C.mmp  and  Hydra,  and  we  outline  the  solution  of 
the  Dirichlet  problem  In  Section  3,  we  Introduce  the  various  asynchronous  Iterative 
methods  that  we  have  implemented  on  C.mmp.  In  Section  4,  we  report  the  results  of  the 
experiments,  and,  in  Section  5,  we  present  simple  analytical  techniques  to  account  for 
these  experimental  results.  Concluding  rema.-Ks  are  given  in  the  last  section. 
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2 - Description  of  the  experiments 

In  Section  2.1,  we  only  prcfjnt  the  main  tharaclerislics  of  C.mmp  and  of  Hydra 
which  are  relevant  lo  our  purpose  here;  a formal  presentation  of  C.mmp  is  given  in  [63] 
and  of  Hydra  in  [6A].  LiKcwisc,  a lull  Ircatmcnt  of  the  use  of  the  method  of  finite 
differences  for  solving  the  Dirichlel  problem  can  be  found,  for  example.  In  [22],  and  we 
only  briefly  describe  the  melhod  in  Section  2.2. 

2.1  - The  environment 

The  following  dcscriplion  corresponds  to  a very  simplified  version  of  C.mmp  under  i 

the  operating  system  Hydra  but  will  be  sufficient  to  provide  a reasonable  model  for  our 
experiments.  | 

C.mmp  is  a multiprocessor  composed  of  p processors  (p  is  currently  16,  but,  at  the  | 

time  the  experiments  wero  run,  it  was  oscillating  between  4 and  9),  pj  of  those  processors 
are  PDP-11  model  20  and  P2  - P ~ Pi  PDP-11  model  40.  For  purpose  of  comparison, 
we  will  indicate  with  the  results  the  number  and  type  of  processors  used  in  the  j 

j 

experiments.  Those  processors  are  connected  lo  m memory  blocks  (each  with  IM  words)  i 

1 

through  an  mxp  cross-point  switch;  m is  currently  16  (it  was  13  at  the  time  of  the  i 

experiments),  but,  since  we  are  not  limited  by  the  size  of  the  memory  In  our  experiments, 
the  exact  value  of  m is  irrelevant  here.  In  addition,  each  processor  Is  also  connected  to 
its  own  local  memory  (4K  words).  Although  the  memory  available  Is  very  large,  because  of  1 

t!^e  small  address  field  of  an  instruction  (16  bits),  only  a small  fraction  (32K  words)  is  j 

directly  addressable  by  a process  at  a given  time.  The  Hydra  system,  however,  provides 

the  user  with  the  facility  of  modifying  the  address  registers  In  order  to  access  the  entire 

. 1 
- memory.  | 

The  Hydra  system  also  provides  the  user  with  a set  of  macro-instructions  for  the 
manipulation  of  processes  (creation,  synchronization,  etc.).  In  addition,  the  policy  modulo 
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ensures  some  crilical  functions  of  the  system  (process  scheduling,  processor  allocation, 
etc.)}  in  particular,  it  ensures  that  each  active  process  receives  its  fair  share  of  processor 
time  and  a processor  is  allocated  to  a process  only  (or  some  fixed  quantum  of  lime:  at  the 
end  of  a quantum  the  processor  is  deallocated  from  the  process,  and  the  taller  is  put  back 
for  re -scheduling  into  the  pool  of  processes  waiting  (or  a processor. 

2.2  - The  problem 

Wo  consider  a well-known  problem,  namely,  the  so-called  Dirichlet  problem  for 
Laplace's  equation  (sec,  for  example,  [22,  Section  20.9]). 

The  problem  is  lo  solve  Ihe  parlial  differcnlial  equalion: 

u ♦ 11  “ 0 (2.1) 

yy 

in  a rectangular  cfomain  D of  IR^:  D { (z,y)  \ 0ixiof,0£yifi},  when  values  of  u on 
the  boundary  S of  D are  specified  by  the  condition: 

a 

u « g , (2.2) 

for  some  given  function  g defined  on  J.  Many  applications  require  solving  this  partial 
differential  equalion  (or  very  similar  ones)  [22].  ' 

An  approximation  to  the  solution  of  equation  (2.1)  can  be  obtained  through  the 
method  of  finite  differences.  Assume  that  oc  - (n*l)h  and  and  define  a regular 

grid  on  the  domain  D with  mesh  size  h.  This  induces  tho  set  of  points 
{ Mj  y (x^-ih,y  ^mjh)  \ 0iiin*l,0iji  m*l  }.  Let  y denote  the  values  uqj, 

j’  ^'"iO  boundary  S,  are  known  from  equation  (2.2).  Using,  for  the 

second  order  derivative  at  the  point  (x,y),  the  approximation 
u^^(x,y)  - [u(x*h,y)  * u(x-h,y)  - 2u<x,y)]/h^ 

and  a similar  approximation  for  Uyy(x,y),  it  can  be  shown  (see,  (or  example, 
[22,  Section  23.4])  that  a solution  lo  the  set  of  linear  equations: 

• “i.;-l  - \J*I  liiin,  1 ijim,  (2.3) 

gives  an  approximation  to  the  solution  of  equation  (2.1)  for  the  points  j within  an  error 
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of  order  (assuming  bounded  properties  of  the  fourth  order  derivatives  of  the  solution 
u).  A piecewise  linear  approximation  for  the  solution  u on  the  domain  0 can  then  be 
deduced  from  the  solution  of  system  (2.3). 

The  set  of  equations  (2.3)  constitutes  a linear  system  for  which  we  arc  investigating 
the  solution.  This  system  can  be  written,  in  matrix  form,  as: 

A X - a . (2.4) 

When  X is  the  nm -vector  corresponding  to  the  row -major  ordering  of  the  grid  points: 

* * •”»  “1,2*  "•*  » 

we  deduce  from  this  ordering  the  nmxnm-matrix  A and  the  nm-veclor  a of  equation  (2.4), 
the  latter  being  known  from  the  values  of  the  function  g giving  the  boundary  conditions. 

I . 

Different  iterative  schemes  have  been  implemented  on  C.mmp  to  solve  this  system. 
They  are  described  in  the  following  section. 

3 - Some  implementations  of  asynchronous  iterations 

The  matrix  A of  equation  (2.4)  is  a very  sparse  matrix  (at  most  five  elements  are  not 
zero  in  any  given  row),  and,  in  this  case,  iterative  methods,  although  they  do  not  provide 
us  with  the  exact  solution,  are  usually  advantageous. 

The  first  two  methods  we  have  considered  are  two  basic  Iterative  methods:  the 
point  Jacobi  and  the  Gauss -Seidel's  methods.  These  two  methods  have  been  widely 
studied  and  will  be  useful  as  a basis  of  comparison.  These  and  other  iterative  methods 
that  we  have  implemented  are  described  in  the  following  sections.  Throughout,  we  discuss 
parallel  implementations  with  k processes  (1:  « 1 corresponding  to  a sequential 
implementation),  and,  for  simplicity,  we  assume  that  the  size  nm  of  the  matrix  /4  is  a 
■multiple  of  k and  let  q - nm/k.  In  all  implementations,  we  make  use  of  a global  vector, 
called  X,  to  contain  the  current  value  of  the  solution  vector. 
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3.1--  Jacobi's  mathod  and  As/nchronous  Jacobi's  mothod  g- 

Since  all  cliaconal  elements  of  the  malrix  A have  the  same  value  of  4,  the  point 
Jacobi  malrix  is  readily  obtained.  Lei  x(i)  denole  Ihe  i-th  llerate  computed  by  Jacobi's  • 
method.  We  simply  deduce  from  equation  (2.4)  that: 

xfi*!)  • (I  - I A)  xd)  * ia  m D xd)  * b . 

4 4 

The  matrix 

B - / - i/4 
4 

is  the  Jojcobi  matrix  associated  with  our  problem.  This  matrix  has  been  extensively 
studied,  and  its  spectral  radius,  which  determines  the  rale  of  convergence  of  Jacobi’s  - 

method,  is  given  by: 

p(B)  m L(  cos  + cos  J . (3.1) 

Wo  see  that  with  Jacobi's  method  all  components  of  an  Iterate  are  computed 
simultaneously  using  the  values  of  the  previous  iterate,  and  that  parallelism  can, 
therefore,  be  introduced  easily.  A natural  parallel  implementation  with  k processes  is  to 
simply  decompose  the  evaluation  of  an  iterate  into  k subcomputations,  each  one 
corresponding  to  the  evaluation  of  a subset  of  <j  » nm/k  components,  and  to  have  the  k 
processes  carrying  out  the  evaluation  of  the  k subsets  of  components  in  parallel.  When  a 
process  completes  its  computation,  it  must  then  block  Itself  and  wait  until  the  completion 
of  all  other  subcomputations  before  starting  the  evaluation  of  the  next  Iterate.  Our 
implementation  corresponds  to  this  description,  in  which  process  always  evaluates  the 
first  q components  of  the  iterate,  process  P2  the  next  q components,  ...  and  process  Pj^  the 
last  q components.  After  each  subcomputation  all  processes  synchronize  themselves  using 
a semaphore,  and,  after  having  updated  the  components,  they  all  resume  their  executions 
for  the  evaluation  of  the  next  iterate. 


The  complete  synchronization  of  all  processes  at  each  step  of  the  Iteration  Is  an 
evident  drawback  in  the  parallel  implementalion  of  Jacobi’s  method,  and  we  can  anticipate 
that  this  will  result  in  a substantial  overhead.  The  Asynchronout  Jacobl’t  method  (or  AJ 


method)  is  a variation  of  Jacobi's  method  in  which  a process  never  waits  (or  the  other  j 


processes  to  complete  their  computations.  As  soon  as  a process  completes  the  evaluation 
of  its  subset  of  components,  it  releases  the  new  values  for  the  other  processes  by 
updating  the  corresponding  components  of  the  global  vector  X,  and,  Immediately  after,  the 
process  starts  re-evaluating  its  subset,  using  in  the  computation,  the  values  of  the 

components  as  they  arc  known  at  the  beginning  of  the  re -evaluation.  The  AJ  method  has 

been  implemented  using  a critical  section  for  updating  the  components  of  the  global  vector 
X at  the  end  of  an  evaluation,  and  for  copying  the  components  of  X required  for  the  next 
evaluation. 

It  can  be  seen  easily  that,  if  a process  is  never  suspended  indefinitely,  the  AJ 

method  can  be  expressed  as  an  asynchronous  iterative  method  relative  to  the  linear 

operator  corresponding  to  the  Jacobi  matrix  B.  Since  6 is  a non-negative  matrix  with  a 
spectral  radius  less  than  unity,  it  is  a contracting  matrix,  and  the  convergence  of  the  AJ 
method  for  our  problem  is  a direct  consequence  of  the  results  of  Chapter  Hi. 

3.2  - Gaiiss'Seidel's  method  and  Asynchronous  Gatiss-Seidsl's  method 


Gauss -Seidel's  method  differs  from  Jacobi's  method  in  that  the  components  of  an 
iterate  are  evaluated  in  sequence  and  the  value  of  x/i)  is  used  in  the  computation  of  x^d) 
when  s > r (that  is,  as  soon  as  it  is  available).  Let  L and  U be  the  strictly  lower  and  upper 
triangular  matrices  defined  from: 

e - / - i/1  - L ♦ U. 

4 

The  sequence  of  iterates,  for  Gauss -Seidel's  method,  satisfies: 
x(i*l)  m L x(i*l)  * U x(i)  ♦ 6 . 

The  matrix 

£ - <I-Lr‘u 

defines  x(i*l)  directly  as  a function  of  x(i).  Us  spectral  radius  determines  the  rate  of 
convergence  of  Gauss -Seidel's  mclhod  and  is  given  by: 
p(£)  - lp(D)f  , 


(3.2) 
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where  fi(D)  is  the  spectral  radius  of  the  Jacobi  matrix  and  is  given  by  equation  (3.1). 

We  notice  that  Gauss-Scidol's  method  is  intrinsically  sequential,  and  thal  parallelism 
cannot  be  easily  introduced.  The  method  has  been  implemented  sequentialty  (1.  e.,  with  1 
process)  as  a particular  case  of  the  Asynchronotu  Causs-Soidal's  method. 

The  Asynchronoiu  Causi-Soidcl's  method  (or  ACS  method)  is  similar  to  the  AJ  method 
except  that  a process  evaluates  the  components  in  its  subset  sequentialty  and  uses  the 
new  value  of  a component  within  the  iame  subset  as  soon  as  It  becomes  available.  In  this 
respect,  the  AGS  method  resembles  Gauss -Seidel's  method  for  the  computation  within  a 
subset  of  components,  and,  in  particular,  when  the  AGS  is  implemented  with  only  one 
process,  It  simply  reduces  to  Gauss-Seidcl’s  method. 

As  in  the  case  of  the  AJ  method,  the  AGS  method  can  be  shown  to  correspond  to  an 
asyr>chronous  Iterative  method  relative  to  the  Jacobi  matrix  S,  and,  in  this  case  too,  the 
convergence  of  the  AGS  method  follows  from  the  results  of  Chapter  III  since  the  matrix  B 
(in  the  particular  case  of  our  problem)  Is  a contracting  matrix. 

3.3  - Purely  Asynchronous  iterative  method 

The  Purely  Asynchronous  method  (or  PA  method)  is  the  simplest  method  we  have 
Implemented.  It  basically  resembles  the  AGS  method,  but  It  uses  no  critical  section  for 
releasing  the  values  of  the  components  In  its  subset  of  for  copying  the  values  of  the 
components  required  in  the  computations.  Rather,  a process  fetches  directly  from  the 
global  vector  X the  values  of  the  components  as  they  are  needed  and  releases  new  values 
of  the  components  one  by  one,  immediately  after  the  evaluation  of  each  component.  Again, 
the  PA  method  can  be  easily  expressed  as  an  asynchronous  Iteration  relative  to  the  linear 
operator  corresponding  to  the  contracting  matrix  B,  and  the  convergence  of  the  PA  method, 
for  our  problem,  follows  directly  from  the  results  of  Chapter  111. 

In  addition  to  being  the  simplest  method  to  implement  from  a programming  point  of 
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view,  the  PA  method  is  also,  r.parewise,  the  most  efficient  method  since  no  extra  variable 

is  required  to  copy  the  values  of  an  iterate  as  of  the  beginning  of  an  evaluation  or  to 

contain  the  new  values  of  the  components  before  being  released.  The  main  advantage  of 
/ 

the  PA  method,  however,  is  the  total  absence  of  any  form  of  synchronization,  which, 
therefore,  makes  it  very  attractive  for  implementation  on  an  asynchronous  multiprocessor. 

An  apparent  disadvantage  of  the  PA  method  is  that  all  processes  frequently  access 
the  common  global  vector  X,  therefore  possibly  causing  memory  conflicts.  This  is  not  so 
for  the  particular  problem  we  are  considering  in  case  of  a large  system  of  equations  (1.  e., 
for  large  n and  m).  Because  of  the  sparsity  and  the  special  form  of  the  matrix  associated 
with  our  system,  accesses  to  the  vector  X by  a given  process  will  be  mostly  confined  to 
accesses  of  components  within  its  own  subset  and  only  a few  accesses  to  components  In 
the  two  adjacent  subsets.  Moreover,  this  is  the  general  case  for  the  solution  of  linear 
systems  resulting  from  the  application  of  the  method  of  finite  differences  to  partial 
differential  equations.  Therefore,  this  apparent  problem  can  be  solved  easily  simply  by 
allocating  different  memory  banks  to  difforents  subsets  of  components  of  the  global  vector 

X- 

Another  problem  with  the  PA  method  is  specific  to  C.mmp  (and  Cm*)  and  is*  due  to 
the  absence  of  uninterruptible  double  word  instructions  on  the  PDP-11  (or  the  LSI-1 1).  In 
particular,  since  a floating  point  number  is  implemented  on  two  consecutive  16  bit  words, 
simultaneous  updating  and  reading  of  the  same  component  by  two  processes  might  result 
in  a lost  of  precision  of  the  last  i6  bits  of  the  mantissa.  Although  this  problem  is  very 
unlikely  to  occur,  it  is  real,  and  the  precision  achievable  on  the  solution  vector  has  to  be 
chosen  accordingly. 

3.4  - Other  possible  implementations 

The  methods  we  have  introduced  are  Intended  to  be  an  lltustratlon  of  the  Issues 
raised  by  the  Implementation  of  parallet  algorithms  on  an  asynchronous  multiprocessor, 
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and  they  are  not  necessarily  the  most  elficient  way  to  solve  a linear  system  of  equations 
by  iteration.  In'  this  section,  we  mention  several  techniques  which  should  bo  used  in  the 
practical  Implementation  of  asynchronous  iterative  methods. 

3.4.1  > Asynchronous  iterations  with  relaxation 

The  introduction  of  a relaxation  factor  is  a well  known  technique  for  improving  the 
performance  of  iterative  methods,  and,  although  we  do  not  report  here  any  results 
concerning  iterative  methods  using  relaxation,  we  have  run  some  experiments  which  show 
that  the  introduction  of  a relaxation  factor  is  a very  promising  way  to  accelerate 
asynchronous  Iterative  methods. 

Let  F be  an  operator,  and  let  o be  a positive  scalar.  An  iteration  relative  to  F with 
the  relaxatir-.n  factor  a defines  the  sequence  of  iterates  through: 
x(i*I)  • aF  x(i)  ♦ (l-a)  x(i) . 

In  particular,  when  o - 1,  this  corresponds  directly  to  the  iteration  relative  to  F . This 
^technique  is  very  useful,  in  general,  since  the  relaxation  factor  o can  be  chosen  to 
maximize  thq  efficiency  of  the  iteration. 

As  particular  cases,  let  us  examine  the  methods  we  have  implemented.  The  Jacobi 
Ovar-Relaxatian  method  (or  JOR  method)  produces  the  sequence  of  iterates  defined  by; 

x(i*l)  ■ a [ (I  - - A)  xU)  ♦ a ] ♦ (i-a)  *(i)  , 

4 4 

and,  therefore,  corresponds  to  Jacobi's  method  with  the  Jacobi  matrix: 

S,,  " I-^oA  - cj  B ♦ (l-a)  I . 

« 4 

It  follows  that,  in  our  case, 

- |J-«|  ♦ a p(B> , 

therefore,  o ■ i minimizes  p(B^),  which  means  that  Jacobi's  method  cannot  be  improved 
using  relaxation. 

The  Successive  Over-Relaxation  method  (or  SOR  method)  is  derived  from 


Gauss -Seidel's  method.  The  SOR  method  defines  the  sequence  of  iterates: 
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x(i*i)  ^ a [ L x(i*l)  * U x(i)  * b ] * (l-a)  x(i) , 
and  it  can  be  shown  (see,  for  example,  [62,  p.  203])  that  the  spectral  radius  of  the  SOR 
matrix 


- (/  - aLr^[(I-a)I  * oU] 

is  minimized  when: 

o «■  — 5 — . 

1 * Vl-p^(B) 

Similarly  we  can  define  the  AJORt  ASOR  and  PAOR  methods  from  the  AJ,  AGS  and  PA 
methods,  respectively.  All  three  methods  arc  easily  shown  to  correspond  to  asynchronous 
iterative  methods  relative  to  the  tinear  operator  associated  with  the  matrix  B...  In 
particular,  since 

p(\B^)\  m li-ej|  ♦ o p<B) , 
provided  that: 


0 < o < 


(3.3> 


J ♦ p(B)  ' 

the  matrix  is  a contraclinj;  matrix,  and  we  arc  guaranteed  of  the  convergence  of  all 
three  methods  In  the  particular  case  of  our  problem.  Nothing,  however,  is  known  in 
general  as  to  the  best  o,  and  further  results  in  this  direction  would  certainly  be  of 
interest.  Note  that  condition  (3.3)  only  represents  a sufficient  condition  for  convergence, 
and  that  the  methods  can  still  converge  outside  of  this  range. 


3.4.2  - Adaptative  asynchronous  iterations 


All  of  the  Implementations  thal  we  have  proposed  are  based  on  a static 
decomposition  of  the  computation  involved  in  the  evaluation  of  an  Iterate,  and,  in  all  cases, 
each  process  is  assigned  to  the  evaluation  of  a fixed  subset  of  components.  With  Jacobi's 
method,  this  results  in  a substantial  overhead  since  all  processes  have  to  wait  for  each 
other  at  the  end  of  each  step  of  the  iteration.  A possibility  for  reducing  this  overhead  is 
to  decompose  the  components  of  an  Iterate  into  more  subsets  than  processes,  and  to  tet 
the  processes  adjust  their  own  speeds  by  evaluating  more  or  fewer  subsets  of 
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components.  For  example,  the  parallel  Implemcntallon  of  Jacobi’s  method  with  2 processes 
which  seems  the  best  suited  for  execution  on  an  asynchronous  multiprocessor  is  to  have 
one  process  update  the  components  starling  with  the  first  one  and  to  have  the  second 
process  update  the  components  starting  with  the  last  one;  an  iteration  step  terminates 
when  the  two  processes  meet  (not  necessarily  exactly  in  the  middle).  With  this 
implementation,  the  difference  in  execution  times  between  the  two  processes  is  limited  at 
most  to  the  time  to  evaluate  only  one  component,  which  obviously  reduces  significantly 
the  waiting  time. 

Another  way  to  laKe  into  account  the  different  speeds  of  the  processes  would  be  to 
subdivide  the  components  into  subsets  of  different  sizes,  and  assign  the  computation  of  a 
larger  subset  of  components  to  a faster  process.  The  speed  of  a process,  however, 
depends  mainly  on  the  speed  of  the  processor  on  which  the  system  decides  to  execute  the 
process,  and  this  is  usually  not  known  a priori. 

There  is  another  advantage  of  not  pre-assigning  to  a process  the  evaluation  of  a 
fixed  subset  of  components  since,  at  each  step  of  the  iteration  this  allows  for  some 
flexibility  in  the  selection  of  the  subset  to  be  evaluated  next.  Many  criteria  can  be  used 
for  this  selection,  in  particular: 

(1)  LRU:  the  subset  selected  is  the  one  which  has  been  the  Least  Recently 
Updated  among  those  not  currently  updated. 

(2)  GRE:  the  subset  selected  is  the  one  which  carries  the  Greatest  Relative  Error 
(also  among  those  which  are  not  currently  updated). 

The  GRE  selection,  for  instance,  should  increase  the  efficiency  of  an  iterative  method  by 
reducing  the  number  of  iterations  required  to  achieve  some  given  admissible  error.  The 
selection  of  a new  subset  at  each  step  of  the  iteration  might,  however,  Introduce 
additional  overhead  and,  in  particular,  will  almost  necessarily  require  the  use  of  a critical 
section.  We  do  not  think  that  this  should  be  used,  therefore,  in  conjunction  with  the  PA 
method. 
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3.5  **  Organization  of  tho  program 


Before  presenting  the  resutts  we  give  a brief  description  of  the  programs.  All  of 
Ihe  different  methods  have  been  implemented  in  BLISS-11  [15]  and  all  programs  have 
basically  the  same  following  structure. 


Master  process; 


Computational  process  i: 


Initialization:  road  in  n,  m,  c,  fcj 
for  t - 1.  k do 

Create  and  start  process  t; 
for  i » 1,  fc  do 
P<complelion)j 

Output  the  statistics  about  the  runt 


’ P(mulex)i 

Read  all  necessary  components  of  X} 

. Vtmulex); 
repeat 

Evaluate  all  components  of  subset  i; 

' P(mulex)i 

Update  all  components  in  subset  t; 
Read  all  necessary  components  of  X} 
. V(mulex)j 

until  global  error  < ci 
V(complelion)j 


The  method  implemented  by  this  program  is  embedded  in  the  instruction  "Evaluate 

I 

all  components  of  subset  t."  From  the  program  each  process  can  be  thought  of  as  a 
succession  of  identical  eyelet;  each  cycle  being  composed  of  an  evaluation  seetion  followed 
by  a critical  section. 


The  programs  for  Jacobi's  method  and  (or  the  PA  method  are  slightly  different  but 
follow  basically  the  same  structure. 


4 - Tho  results  of  the  experiments 

We  report,  In  this  section,  the  measurements  obtained  by  running  on,  C.mmp  the 
various  Iterative  methods  that  we  have  introduced  in  Section  3.  We  discuss,  in 
Section  4,1,  the  different  parameters  of  the  program  and  the  decisions  leading  to  their 
choices.  In  Section  4.2,  we  present  the  local  behavior  of  the  processes  within  each  cycle, 
and,  in  Section  4.3,  we  present  the  global  results  and  compare  the  different  methods. 
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4.1  - Choico  of  iho  parnmotors 

All  of  the  experimcnls  have  been  run  under  the  same  conditions,  and,  before 
presenting  the  results  of  the  measurements,  we  briefly  discuss  below  the  choices  we  have 
made  for  the  various  parameters  of  our  problem. 

4.1.1  - Size  of  fho  system 

We  want  to  choose  the  size  of  the  system  to  be  solved  (i.  e.,  to  choose  n and  ni) 
targe  enough  so  that  the  problem  be  realistic,  but,  on  Iho  other  hand,  since  we  do  not 
want  to  deal  here  with  problems  of  memory  addressing,  we  have  limited  ourselves  to  a 
size  that  permits  all  of  the  data  to  bo  directly  addressable.  The  main  restriction,  in  this 
case,  comes  from  the  fact  that  the  size  of  the  data  local  to  a computational  process  has  to 
fit  into  the  slack  of  local  variables  (contained  in  page  0),  1.  e.,  in  about  3K  words.  With 
the  AJ  method,  for  instance,  each  process  has  to  have  the  values  of  the  components  it  Is 
updating  and  a copy  of  the  values  of  the  components  used  in  the  evaluation,  as  of  the 
starting  lime  of  the  computation.  There  may  be  up  to  2nm  elements  each  of  which  fits  into 
two  words  of  memory.  Therefore  nm  has  to  be  chosen  below  700.  The  number  504  has 
been  chosen  (mainly  because  it  is  divisible  by  1,  2,  3,  4,  6,  7,  8,  9 ...  and  almost  by  5 too!), 
and  n and  m have  been  chosen  to  be  21  and  24,  respectively,  in  the  series  of  experiments 
reported  here. 


4.1.2  - Error  of  the  solution  vector 


An  experiment  Is  stopped  when  some  norm  of  the  error  vector  is  smaller,  in 


magnitude,  than  a given  admissible  error  c.  (The  norm  we  have  chosen  is  H.IIq,,  the 
maximum  over  all  components.)  Since  we  want  to  be  able  to  compare  the  experimental 
results  with  Ihe  results  of  a theoretical  analysis,  we  want  to  choose  e small  enough  so  that 
asymptotic  rales  of  convergence  ran  be  estimated  through  experimental  results.  For  our 


purposes,  the  asymptotic  rate  of  convergence  for  a method  77?  can  be  defined  as: 

)i<m>  . 
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where  is  the  error  vector  after  the  t-th  sub-iteration  (a  sub-iteration  corresponds  to  an 
evaluation  by  one  process  so  that  k sub-iterations  are  carried  out  simultaneously  In  a 
parallel  Implementation  with  k processes),  and  where  n^  is  the  mean  number  of  limes  each 
component  has  been  evaluated  up  to  the  t-lh  sub-iteration.  For  all  the  implementations 
we  have  considered  the  components  are  divided  into  k equal  subsets,  and  n^  is  simply 
given  by  n^  » i/k.  (The  norm  in  equation  (4,1)  is  the  same  norm  as  the  one  used  in  the 
termination  criterion.)  This  definition  of  asymptotic  rate  of  convergence  corresponds  to 
the  classical  definition  and,  in  particular,  we  have  .^(JacobU  - -log  p(BJ. 

The  Interpretation  of  the  rale  of  convergence  is  that  1/Ji(77?)  Is  an  asymptotic 
measure  of  the  average  number  of  limes  each  component  has  to  be  updated  in  order  to 
decrease  the  norm  of  the  error  vector  by  a factor  of  JO  (if  the  log  of  equation  (4.1)  is  base 
10).  In  particular,  when  c tends  to  0,  the  average  number  of  iterations  (per  component) 
required  lo  solve  the  system  with  an  error  less  than  c grows  linearly  like  -log(£)/^(772). 
In  Figure  4.1  we  have  plotted  the  number,  N(c),  of  iterations  required  to  solve  our  system 
(n  - 21.  ni  « 24)  within  an  error  c,  versus  -logfc)  for  both  the  AJ  and  the  AGS  methods 
when  k • 1 and  3 processes  are  used.  This  shows  clearly  that  the  asymptotic  rate  of 
convergence  is  reached  very  fast  since,  when  -log^c)  > 0.2S  (i.  e.,  c < 0.56),  N(e)  varies 
linearly  with  -logCc). 

When  - / the  AJ  and  AGS  methods  reduce  to  Jacobi’s  and  Gauss -Seidel’s  methods, 
respectively,  and  the  slopes  obtained  from  Figure  4.1  can  be  compared  to  the  theoretical 
values  [-log  and  [-log  respectively,  where: 

p(D)  - ^ r cos  ,-5^  ♦ cos  ^ ,)  ~ 0.99097  , 
p(L)  - [p(B)]^  - 0.98202. 

In  Table  4.1,  we  report  the  observed  and  Ihcorctical  number  of  Iterations  required  to 
asymptotically  divide  the  norm  of  the  error  vector  by  a factor  of  10. 
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AJ  {k  - 3) 
AJ  (*  - 1) 


AGS  {k  - 3) 
AGS  (fc  - 1) 


Figure  4.1  - Number  of  Iterations  required  with  the  AJ  and  AGS  methods 


Observed: 

Theoretical: 


AJ 

k • i k ~3 
254  257 
254.79 


AGS 

k~ I km  3 
127  143 

127.B9 


Table  4.1  - Comparison  of  the  rates  of  convergence  for  the  AJ  and  AGS  methods 

In  all  the  experiments  reported  below,  the  termination  criterion  uses  c • 0.1  for  the 
value  of  the  admissible  error.  This  value  corresponds  to  a reasonable  execution  time,  In 
the  order  of  3 min.,  and  allows  us  to  base  our  measurements  on  more  experiments. 
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4.1.3  - Olhor  paramolorc 

Since  we  arc  mainly  Interested  in  comparing  the  different  methods  with  respect  to 
their  rates  of  convergence  toward  the  solution  vector,  we  simply  set  the  displacement 
vector  6 to  be  0 so  that  the  solution  is  known  to  bo  f » 0.  As  the  system  we  are  studying 
is  linear,  we  do  not  loose  any  generality,  but  this  will  result  In  a simpler  test  for  the 
termination  criterion  since,  in  this  case,  Ihe  current  iterate  is  exactly  the  error  vector. 
Lastly,  in  all  the  experiments,  the  initial  approximation  has  been  chosen  as  the  vector  with 
all  components  equal  to  i. 

4.2  - Local  behavior  of  the  program 

We  present,  in  this  section,  the  local  behavior  of  the  computational  processes  by 
looking  at  the  time  they  spend  during  each  cycle  in  the  evaluation  section  and  (except 
with  the  PA  method)  in  the  critical  section  of  the  program.  In  Section  A.2.1,  we  present 
the  results  of  the  measurements,  and,  in  Section  4.2.2,  we  give  an  interpretation. 

4.2.1  ~ Results  of  the  measurements 

The  results  presented  in  this  section  have  been  derived  from  the  information  given 
by  the  tracer  David  Lamb  implemented  on  C.mmp.  (Among  many  olhor  things,  each  P and  V 
operation  is  reported  by  the  tracer  along  with  the  time  instant  when  It  was  executed,  the 
process  executing  the  operation  and  the  processor  carrying  out  the  execution.)  Since  the 
code  of  the  programs  (or  the  different  methods  are  identical  (with  respect  to  these 
measurements)  we  limited  ourselves  to  take  measurements  on  the  AJ  method.  Four 
experiments  have  been  run  with  k » 1,  3,  6,  and  17  processes.  In  all  of  them  p - 7 
processors  were  available;  5 PDP-11/20  and  2 PDP-11/40.  The  histograms  for  the 
distribution  of  the  lime  spent  in  the  evaluation  section  as  well  as  the  distribution  of  the 
time  spent  In  the  critical  section,  for  each  of  the  experiments,  are  plotted  in  Figures  4.2 
through  4.9.  (In  the  case  of  the  critical  section,  the  results  presented  In  these  figures  also 
Include,  when  k > t,  the  possible  waiting  lime  before  entering  the  critical  section.) 
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Figure  4.4  - Time  spent  in  the  evaluation  section  {k  - 3) 
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Figure  4.5  - Time  spent  in  the  critical  section  (k  ■ 3) 
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These  figures  show  clearly  that  (wo  different  types  of  processors  are  used.  When 
k • 3,  for  example,  the  distributions  have  two  main  peaks  (at  about  18  ms.  and  28  ms.  in 
Figure  4.5),  and,  in  particular,  we  can  derive  from  our  results  an  estimate  for  the  relative 
speeds  of  the  PDP-11/20  and  the  PDP-11/40.  The  ratio  of  the  speeds  is  certainly 
problem  dependent  but,  in  our  case,  / second  on  a PDP-11/40  corresponds  to  about 
/.■^seconds  on  a PDP-11/20,  1.  e.,  the  use  of  a PDP-11/40  instead  of  a PDP-11/20 
corresponds  to  a gain  of  about  307,  in  running  time.  If  we  took  more  closely,  we  can  see 
that  each  main  peak  is  composed  of  several  subpeaks  corresponding  to  each  processor; 
two  different  processors,  even  of  the  same  type,  actually  have  different  speeds.  This  is 
particularly  evident  in  Figures  4.2  and  4.3,  where  the  two  main  peaks  correspond  to  the 
executions  on  each  of  the  2 PDP-11/40.  Since  it  is  the  policy  of  Hydra  to  allocate  first 
the  PDP-ll/40,  the  third  peak  in  Figure  4.2  does  not  correspond  to  to  an  execution  on  a 
PDP-11/20  but,  in  fact,  corresponds  to  executions  on  a PDP-11/40  which  Include  some 
overhead  due  to  the  re -scheduling  of  a process  at  the  end  of  a quantum. 

4.2.2  *■  An  inierp-ratation  of  the  results 

The  main  statistics  about  the  distributions  presented  in  the  figures  of  Section  4.2.1 
arc  collected  in  Table  4.2  (a)  and  (c)  for  the  evaluation  section  and  the  critical  section 
(including  the  possible  waiting  time),  respectively.  In  addition,  Table  4.2  (b)  contains  the 
same  statistics  concerning  the  critical  section  by  Itself,  excluding  any  waiting  time.  (All 
timings  in  the  table  are  expressed  in  ms.) 

In  Figures  4.10,  4.1 1 and  4.12,  we  have  plotted  the  variations  of  the  average 
execution  times  for  the  two  sections  of  the  program  as  they  can  be  found  in 
Table  4.2  (a),  (b)  and  (c),  respectively.  The  results  of  Figure  4.11  represent  strictly  the 
execution  time  of  the  critical  section,  while  the  timings  presented  in  Figure  4.12  also 
contain  the  possible  waiting  time  before  entering  the  critical  section. 
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k ^ 1 

k ~ 3 

k~6 

k m i2 

Minimum 

1123.85 

348.30 

239.36 

100.07 

Maximum 

1889.60 

1524.13 

834.97 

502.02 

Average 

1292.72 

534.35 

423.04 

187.86 

Standard  dev. 

136.51 

118.88 

84.23 

47.10 

Coeff.  of  var. 

0.106 

0.222 

0.199 

0.251 

(a)  Evaluation  section 

k - / 

k • 3 

k - 6 

k m n 

Minimum 

43.49 

16.82 

13.59 

7.44 

Maximum 

174.82 

186.02 

170.96 

21.91 

Average 

47.75 

23.96 

21.65 

11.57 

Standard  dev. 

13.91 

11.71 

7.67 

2.77 

Coeff.  of  var. 

0.291 

0.488 

0.354 

0.240 

(b) 

Critical  section  (without  the  blocking) 

k - i 

k~3 

k <*  6 

k m 12 

Minimum 

43.49 

16.82 

13.59 

7.44 

Maximum 

174.82 

199.64 

196.97 

431.65 

Average 

47.75 

25.63 

27.81 

177.04 

Standard  dev. 

13.91 

13.90 

17.67 

48.35 

Coeff.  of  var. 

0.291 

0.542 

0.635 

0.273 

(c)  Critical  section  (including  the  blocking) 


Table  4.2  - Statistics  about  the  two  sections  of  the  program 


Time  (ms.) 
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Time  (ms.) 


0 1— — I 1 1 1 1 1 1 1 1 1 h 

0 1 2 3 456709  10  11  12 


Number  of  processes 

Figure  A.ll  - Mean  time  spent  in  the  critical  section  (waiting  time  excluded) 
Time  (ms.) 


Number  of  processes 

Figure  4.12  - Mean  time  spent  in  the  critical  section  (waiting  time  included) 


We  note  that,  while  a process  does  not  suffer  a very  important  delay  (before  the 
critical  section)  in  the  parallel  implementation  with  ft  - 3 and  6 processes,  Figure  4.12 
shows  a very  sharp  increase  in  the  waiting  time  for  k - 12.  In  fact,  further  results 
obtained  by  tracing  the  execution  of  the  program  showed  that,  in  the  parallel 
implementation  with  12  processes,  the  queue  to  the  critical  section  contained  almost 
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always  6 or  more  processes  (not  counting  the  process  executing  the  critical  section).  This 

means  that  there  has  almost  always  been  at  least  one  processor  idle  among  the  7 

processors  available.  The  (act  that  Ihe  processes  are  never  competing  for  a processor 

can,  therefore,  explain  the  steady  decrease  of  the  execution  times  presented  in 

Figures  4.10  and  4.11.  In  both  cases  a first  approximation  can  be  obtained  in  the  form 

a ♦ — 6,  for  some  appropriate  constants  "a  and  6.  However,  since  it  will  be  useful  in 
k 

Section  5,  we  develop  below  a closer  approximation  which  takes  into  account  the  policy  of 
Hydra  to  allocate  first  a PDP-11/40  (1.  e.,  a faster  processor). 


Let  Pi  and  pp  number  of  PDP-11/20  and  PDP-11/40  available,  respectively; 

and  let  p - ♦ pp-  We  denote  by  p the  relative  speeds  of  the  two  types  of  processors; 

4 

experimental  evidence,  from  the  results  of  Section  4.2.1,  showed  that  p ~ 1.4  corresponds 
to  a reasonabte  estimate  in  the  particular  case  of  our  problem.  Consider  a program  which 
requires  an  average  time  * when  it  is  executed  on  a PDP-11/40,  and  let  be  the  average 
execution  time  of  the  same  program  when  it  is  executed  in  an  environment  with  k 


processes  (each  process  is  assumed  to  receive  its  fair  share  of  computing  power).  Firstty, 
when  k i P2,  a PDP-11/40  is  allocated  to  the  process,  and  its  actual  execution  time  is. 


therefore,  simply  given  by: 


^ X if  k a P2 . • (4.2) 

Next,  assume  that  P2  < k s p • pi  * P2.  In  this  case,  the  process  is  allocated  a PDP-1 1/40 

h'^p 

the  fraction  ^ II'®  time,  and  it  is  allocated  a PDP-11/20  Ihe  fraction  — of  the  time. 

f Py 

* This  means  that  / unit  of  actual  execution  time  contributes  to  ~ units  of 

(PDP-1 1/40)  time  toward  the  total  time  x.  We  then  have: 

_ . _ /a  'a\ 


• X if  P2  < k i p " Pi  * P2  • (4.3) 

k ~ P2  * f-P2 

Lastly,  If  k > p m Pi  * p2,  [e\  us  assume,  as  it  is  evidenced  in  the  experiments,  that  the 
processes  are  not  in  competition  for  a processor  (l.  e.,  at  least  k-p  processes  are  always 
waiting  (or  entering  the  critical  section).  With  the  same  argument  as  above,  we  find,  in 


this  case,  that: 


PP 

XL  ■ X 

* Pj  ♦ PP2 


If  k > P m Pi  * P2 
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This  shows  that,  in  each  of  the  throe  cases,  the  average  execution  time  can  be 
expressed  as: 

*k  - 

where  the  factor  is  deduced  from  equations  (4.2),  (4.3)  and  (4.4). 


Wo  can  now  find  an  approximation  in  the  form  (a  * b j~)  for  the  average  execution 
times  of  the  evaluation  section  and  of  the  critical  section  in  the  implementation  with  k 
processes  (denoted  by  and  c^,  respectively).  We  determine  the  values  a and  6 using  a 
least  square  approximation  to  the  values  in  Table  4.2  (a)  and  (b).  We  find  that: 

IT;,  - (82.89  * 1207.73  p ffc  , (4.5) 

Cfc  - (7.972  * 39.907  p . (4.6) 


Using  pj  m 5 and  P2  - S (and  p - 1.4)  in  the  evaluation  of  the  factor  we  find  that,  for 
/c  - 1,  3,  6 and  12,  the  values  obtained  from  equations  (4.5)  and  (4.6)  are  consistently 
within  157.  of  the  experimental  results.  In  addition,  these  two  equations  provide  us  with 
some  estimates  for  and  Cj^  which  are  a useful  complement  to  the  values  of  Table  4.2,  for 
other  values  of  k. 


4.3  - Global  results 

In  this  section,  we  report  the  global  measurements  of  the  parallel  implementations 
with  k processes  (or  the  iterative  methods  that  we  have  presented  in  Section  3,  Jacobi’s, 
the  AJ  and  the  AGS  methods  have  been  implemented  on  C.mmp  with  a configuration  of 
p m 6 processors  {4  PDP-11/20  and  2 PDP-11/40),  and  all  the  experiments  have  been  run 
with  k m I,  2,  3,  4,  6,  7,  8,  9,  12  and  14  processes.  The  PA  method  has  only  been 
implemented  later,  by  Raskin  [48],  on  Cm*  [59]  (along  with  the  first  three  methods),  and 
the  results  we  present  below  for  this  method  are  the  results  of  his  measurements.  A 
comparison  between  the  results  of  C.mmp  and  of  Cm*  for  the  three  other  methods  showed 
a complete  agreement,  and  we  have  normalized  the  timings  of  the  PA  method  so  that  it 
coincides  with  those  of  the  AGS  method  for  the  implementation  with  I process  (since,  in 


0 1 2 3 4 S 6 7 8 3 1 0 1 1 1 2 1 3 1 4 1 S 

Number  of  processes 

Figure  4.13  - Total  execution  times  with  Jacobi’s,  the  AJ,  the  AGS  and  the  PA  methods 

This  direct  comparison  is  somewhat  "unfair"  vis  i vis  Jacobi's  and  the  AJ  methods 
since  we  know  that,  for  the  particular  problem  we  are  considering,  Gauss-Seidel’s  method 
Is  already  twice  as  fast  as  Jacobi's  method.  In  Figure  4.14,  we  have  reported  the  relative 
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4 5 6 7 a 9 10  11  12  13  14  15 


Number  of  processes 

Figuro  4.14  - Relative  improvements  with  Jacobi's,  the  AJ,  the  AGS  and  the  PA  methods 

Figure  4.14  shows  clearly  the  effects  of  using  the  different  forms  of  synchronization 
in  a parallel  algorithm.  Due  to  the  full  synchronization  of  alt  processes  at  each  step  of 
the  Iteration,  Jacobi’s  method  exhibits  the  worst  behavior  of  all  four  methods,  while  the 
PA  method,  which  uses  no  synchronization  at  all,  achieves  an  almost  optimal  speed-up. 

Although  the  AJ  and  AGS  methods  are  very  similar  in  nature,  Figure  4.14  shows  that 
the  speed-up  ratios  achieved  by  the  two  methods  differ  substantially.  This  difference  Is 
mainly  due  to  the  fact  that  the  total  number  o*  iterations  Increases  only  slightly  with  the 
number  of  processes  for  the  AJ  method,  while  Iho  Increase  Is  more  Important  for  the  AGS 
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method.  This  Is  illustrated  in  Figure  4.15  where  we  have  plotted  the  number,  N(k),  of 
Iterations  required  to  solve  our  system  using  k processes  as  a function  of  k. 


N(k) 


0 t f 1 1 1 1 1 H 1 1 1 1 1 1 • -t 


0 i 2 3 4 S 6 7 8 9 10  1 1 1 2 1 3 1 4 1 5 

Number  of  processes:  k 

Figure  4.15  - Number  of  iterations  required  to  solve  the  system 

Figure  4.15  shows  that  for  the  A.I,  AGS  and  PA  methods  N(k)  increases  regularly 
(and  almost  linearly)  with  k.  This  difference  with  respect  to  the  sequential  method 
(Jacobi’s  or  Gauss-Seidcl’s  method)  is  one  of  the  factors  that  determine  the  total  running 
time  of  the  various  methods,  but,  obviously,  the  presence  (or  absence)  of  synchronization 
is  another  important  factor.  When  the  number  of  processes  Increases,  a critical  section, 
for  instance,  acts  as  a bottleneck,  which  tends  to  decrease  the  parallelism  and  increase  the 
total  execution  lime.  In  the  next  section,  we  proceed  to  the  evaluation  of  this  factor. 

5 “ On  Ihe  analysis  of  algorithms  for  asynchronous  multiprocessors 

We  want  to  illustrate  in  this  section  that  the  analysis  of  parallel  algorithms  for 
asynchronous  multiprocessors  can  benefit  from  techniques  developed  In  the  framework  of 
other  general  theories.  We  show  that  some  simple  results  of  order  statistics  (see,  for 
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example,  [lA])  al^d  ol  queueing  theory  (see,  for  example,  [33])  can  be  used  effectively  In 
the  analysis  of  algorithms  for  multiprocer  sors. 

As  examples  of  multiprocessors  algorithms,  we  use  in  this  section  some  of  the 
asynchronous  iterative  methods  described  in  Section  3.  We  use  the  parallel 
implementation  of  Jacobi’s  method  (Section  3.1)  as  a typical  example  of  a synchronized 
algorithm,,  and  we  use  the  AJ  and  AGS  methods  (Section  3.2  and  3.3)  as  typical  examples  of 
asynchronous  algorithms  in  which  communication  takes  place  through  the  use  of  a critical 
section. 

The  evaluation  of  the  performance  of  an  asynchronous  iteration  depends  principally 
on  two  main  factors.  The  number  of  iteration  steps  required  to  solve  the  system  of 
equations  within  some  given  admissible  error  c is  one  of  the  important  factors  which 
dotermino  the  global  running  time  of  an  iterative  method.  This  number  can  be  derived 
through  the  tools  of  numerical  analysis,  and  we  will  not  be  concerned  with  its  evaluation 
in  this  section  Wo  will  simply  use  the  empirical  results  observed  in  the  experiments 
themselves.  (Upper  bounds  on  (he  number  of  Iteration  steps  for  various  asynchronous 
iterative  methods  have  been  derived  in  Section  6 of  Chapter  III.  In  the  case  of  Jacobi’s 
method,  the  exact  number  of  iterations  can,  in  fact,  be  derived  from  the  theory.)  The 
(average)  time  for  each  process  to  execute  a complete  cycle  (1.  e.,  from  the  instant  it  starts 
an  evaluation  to  the  instant  it  starts  the  next  evaluation)  is  another  Important  factor 
contributing  to  the  global  running  time.  This  factor  is  evaluated  in  the  present  section. 

Wo  assume  throughout  that  the  execution  times  for  the  evaluation  section  by  all  k 
processes  are  independent  identically  distributed  random  variables  distributed  according 
to  the  probability  distribution  Fj^,  associated  with  the  density  (unction  /^.  Let  and  cr^ 
denote  their  mean  and  variance,  respectively.  Similarly,  we  assume  that  the  execution 
times  for  the  critical  section  by  all  k processes  are  independent  Identically  distributed 
random  variables  distributed  according  to  the  probability  distribution  G^,  associated  with 
the  density  function  gj^.  Let  c/^  denote  their  mean.  Estimates  for  the  quantities  and 
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are  given  In  equations  (4.5)  and  (4.6)i  an  estimate  (or  the  quantity  (r|^  can  be  derived 
similarly. 

jn  Section  5.1,  we  consider  Jacobi's  method  and,  in  Section  5.2,  the  AJ  and  AGS 

methods.  The  results  derived  in  these  two  sections  are  compared,  In  Section  5.3,  with  the 

* . 

experimental  results. 

5.1  “ Synchronized  algorithms 

It  follows  from  our  parallel  implementation  of  Jacobi's  method  that  each  process 
cooperating  in  the  evaluation  of  an  iterate  has  the  cyclic  behavior  depicted  in  the  diagram 
of  Figure  5.1. 


Evaluation 

1 

Wailing 

Wailing 

1 

Critical 



1 

section 

section 

section 

1 

section 

part  1 

part  2 

Cycle 

Figure  5.1  - Cyclic  pattern  of  a process  with  Jacobi's  method 

The  first  waiting  section  is  due  to  the  full  synchronization  of  all  processes  at  the  end  of 
the  evaluation  of  an  iterate  and  before  the  evaluation  of  the  next  iterate.  The  second 
waiting  section  is  simply  due  to  the  presence  of  the  critical  section  used  for  updating  and 
reading  the  values  of  the  components  of  the  current  iterate.  (A  process  might  have  to  wait 
if  another  process  is  already  executing  the  critical  section.)  The  average  lime  to 
execute  a complete  cycle  in  the  parallel  implementation  with  k processes  can,  therefore, 
be  decomposed  as: 

H 

whore  af^  and  6^  arc  the  average  execution  times  for  the  first  and  second  parts  of  the 
cycle  respectively. 
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Lei  us  first  consider  the  quantity  aj^.  It  corresponds  to  the  targest  finishing  time  of 

the  evaluation  section  by  the  k processes.  When  kip,  therefore,  a|^  is  simply  given  by 

the  average  of  the  maximum  of  k independent  random  variables  distributed  according  to 

the  same  probability  distribution  and  wo  have  (see,  for  example,  [M,  p.  A6]): 

/ 

[1  - , (5.2) 

wlicrc,  for  clarity,  the  index  k has  been  dropped  from  F Let  us  examine  some 
probability  distributions  F for  which  analytical  results  can  be  derived  from 
equation  (5.2). 


(1)  Exponential  distribution  with  parameter  p 
calculus,  equation  (5.2)  yields: 

(f  - - l/J  L^du 


Using  simple  integral 


- 1/^  Za^-Kda 
F ^0  liiik 


F'O  1 - u 

Z i 

F * ’ 


• {iHk  - (5.3) 

where  is  the  /r-th  harmonic  number. 

(ii)  Uniform  distribution  over  the  interval  [Pj^-cr^VJ,  (i.  e.,  with  mean 

and  standard  variation  o-^).  Integration  of  equation  (5.2)  yields,  in  this  case 
(see,  for  example,  (14,  p.  27]): 

^k  - ^k  * (5.4) 

Simitar  results  can  bo  obtained  for  other  probability  distributions  but  unfortunately 
they  usually  cannot  be  expressed  so  easily.  For  most  common  probability  distributions 
F however,  is  shown  to  be  in  the  form  - t|^*  (as  is  the  case  in 

equation  (5.4),  for  example),  where  the  coefficient  e^f^  (which  depends  on  T^)  can  be  found 

In  many  numerical  tables.  (See,  for  example,  (14,  p.SO]  (or  a short  table  listing  In  the 

case  of  the  normal  and  the  uniform  distributions.) 


When  k > p,  the  quantity  cannot  be  obtained  directly  from  equation  (5.2)  since,  as 
long  as  L processes,  with  p < i ik,  have  not  completed  their  evaluation  sections,  they  are 
In  competition  for  the  p processors  available,  and  they  are,  therefore,  slowed  down  by  the 
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(aclor  R.  Lei  *•,  for  1 i i i k,  be  Ihe  i-lh  smallest  execution  time  required  by  the  k 
processes.  The  first  process  to  complete  its  evaluation  section  has  to  share  the  p 
processors  with  the  remaining  k-i  processes  during  Us  entire  execution.  It  finishes 
therefore  after  a time  y,  « ^ Similarly,  the  second  process  to  complete  Us  evaluation 

j p i 

section,  finishes  after  a time  72  ^ Vi  * ^ process  to  complete  Its 

evaluation  section  finishes  after  a time: 

i */  * ^ ^*k-p  - *k-p-I>  * <^k  - *k-p>  ■ 

The  quantities  x^,  for  i s i s fc,  can  be  evaluated  directly  from  the  distribution  function 

and  we  have  (see,  for  example,  [14,  p.  25)): 

X.  - k(>l:\)/^*'^  tF^-^(t)[l-r(t)i*-^dF(t),  (5-6) 

where,  for  clarity,  the  index  k has  boon  dropped  from  F Again,  x^  can  be  evaluated 
explicitly  for  some  distribution  functions  In  particular,  we  have  (he  following  results. 

(i)  Exponential  distribution  with  parameter  A “ Integrating  equation  (5.6)  by 
parts  and  solving  a recurronre  relation,  we  find  that: 


as.  » 77  . . . ? - ( ■ ^k-i  1 *^k  • 


1 


F k-i^lirak 

whore  Hq  is  defined  to  be  0.  We  deduce  immediately  from  equation  (5.5)  that: 

a*  - I ^ ♦ ^p  1 

(ii)  Uniform  distribution  over  the  interval  [Kf^-ari^VS,  «r;^*<rj^V3).  From  [14,  p.  27]), 
we  obtain: 


^k 


(TL  VJ  . 
♦J  * 


Wo  deduce  immediately  from  equation  (5.5)  that,  in  this  case: 


<^k  - P^k  * ^°-k'^- 

Again,  for  other  probability  distributions  Ff^,  equation  (5.6)  can  always  be  integrated 
numericalty,  and,  for  most  probability  distributions,  numerical  tables  are  available  (see. 


for  example,  [60]  for  the  normal  distribution). 


Let  us  now  consider  the  quantity  bf^  of  equation  (5.1).  Since  all  processes  will  try 
to  access  the  critical  section  at  the  same  time  (when  the  last  process  completes  Its 
evaluation),  b/^  Is  simply  given  by: 
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Table  5.1  summarizes  Ihc  results  of  this  section  and  presents,  for  k >•  t,  3,  6 and  12, 
the  average  time  tor  a complete  cycle  when  the  distribution  f/,  Is  exponential,  normal 
and  uniform.  In  these  throe  cases,  the  parameters  and  are  taken  directly  from  the 
estimates  derived  in  Section  4.2.2j  Of^  has  been  eslimaled  in  the  same  way.  These  results 
are  compared  to  the  results  derived  from  the  experiments  presented  In  Section  4.3.  (All 
timings  in  the  table  are  given  in  ms.) 


k - 1 

k m 3 

fc  . 6 

k ~ 12 

Exponential: 

1338.50 

1087.99 

923.27 

872.88 

Normal; 

1338.50 

694.90 

513.73 

604.39 

Uniform: 

1338.50 

696.74 

511.37 

589.70 

Experimental: 

1327.47 

700.20 

515.96 

629.42 

Table  5.1  - The  average  execution  time  for  a complete  cycle  with  Jacobi’s  method 

We  notice  that  the  exponential  distribution  certainly  does  not  predict  adequately  the 
experimental  results.  A reason  for  this  discrepancy  is  that  the  exponential  distribution 
does  not  take  Into  account  the  standard  deviation  o-j^,  which  Is  a direct  measure  of  the 
fluctuations  in  the  execution  times  of  the  evaluation  section.  These  fluctuations  have  an 
important  role  in  the  case  of  Jacobi’s  method  since  the  processes  (in  the  first  part  of  their 
cycles)  synchronize  themselves  on  the  largest  execution  time.  The  results  obtained  with 
j the  normal  and  uniform  distribution,  on  the  other  hand,  show  a fair  agreement  with  the 

experimental  results;  the  difference,  in  this  case,  is  partly  due  to  the  fact  that  the 
experiments  have  not  always  been  run  in  a consistent  manner  (for  instance,  the  results 
presented  in  Section  4.2  and  4.3  have  not  been  obtained  with  the  same  number  of 
processors). 

5.2  * Asynchronous  algorithms 


In  the  parallel  implementations  of  the  AJ  and  AGS  methods,  the  processes 
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cooperating  in  the  evaluation  of  an  iterate  have  the  cyclic  behavior  depicted  In  Figure  5.2. 
In  this  case,  the  wailing  section  is  only  due  to  the  presence  of  the  critical  section. 

Evaluation  Wailing  Critical 

1 1 1 1 

section  section  section 

Cycle 

Figure  5.2  - Cyclic  p,itlern  of  a process  with  the  AJ  and  AGS  methods 

The  parallel  implementation  with  k processes  on  p processors  can  be  modeled  by 
the  queueing  system  of  Figure  5.3. 


(a)  h customers  in  the  whole  system:  our  proccssest 

(b)  p servers  in  system  (1):  the  evaluation  sectionj 

(c)  1 server  in  system  (2':  the  critical  section! 

(d)  with  the  restriction  that  at  most  p servers  are 
active  at  the  same  time  in  the  entire  system. 

Figure  5.3  - A queueing  system  for  asynchronous  algorithms 

This  queueing  system  has  been  extensively  studied  in  the  case  > p as  a model  of 
time -shared  processor  [55],  [33],  when  the  two  probability  distributions  F ^ and  Cj^  are 
exponential.  We  show  that  the  results  can  be  extended  to  the  case  kip. 

Let  us  assume  that  Ff^  and  C)^  arc  exponential  distributions  with  parameter  p - l/t/^ 
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and  X « l/cj^,  respectively.  For  i - 0,  I,  k,  let  (j^  be  the  steady  state  probabUUy  that  i 
customers  bo  in  system  (I)  of  Figure  5.3  (i.  e.,  i processes  are  executing  their  evaluation 
sections,  while  k-i  processes  are  ready  to  cxecule  the  critical  section).  Lot  Kq  denote  the 
probability  that  no  process  be  executing  the  critical  section,  either  because  all  processes 
are  within  tlielr  evaluation  sections  or,  possibly,  because  no  processor  is  allocated  to  a 
process  ready  to  execute  the  critical  section. 


Wo  assume  throughout  that,  if  there  exists  at  any  time  In  the  entire  system  i 
processes,  with  t > p,  which  are  not  blocked  (waiting  for  another  process  to  complete  the 
critical  section),  each  of  the  t processes  receives  the  same  fraction  £ of  the  computing 
power.  It  follows  directly  that  the  probability  Xq  is  given  by: 

Vq  - qf,  * Z q.  . (5.9) 

Theorem  5.1 


Assume  that  kip.  The  average  time  required  lo  execute  a complete  cycle  is 
given  by: 

- kc^ (5.10) 
where  ttQ  is  the  probability  that  the  server  of  system  (2)  be  idle  (1.  e.,  no  process  Is 
executing  the  critical  section,  although  some  may  be  btocKed  because  no  processors 
are  avaitable).  If  we  assume  that  each  process  which  is  not  blocked  receives  an  equal 

share  of  the  computing  power,  the  probabilities  q^,  (or  i ■ 0,  1 ft,  satisfy: 

<ti  if  i s ft  , 

(i*l)  fik-i  ifpsisft-1,  (5.11) 

p(k^fik-i^^  H Oiiip-1. 

Proof: 

Equations  (5.10)  and  (5.11)  are  immediate  consequences  of  simple  results  of 
queueing  theory.  Equation  (5.10)  follows  directly  from  Little's  formula  (see,  (or  example, 
[33,  p.l7])  by  considering  the  throughput  of  system  (2).  Equation  (5.1 1)  also  follows 
directly  from  the  fact  that  (under  the  exponential  assumption  for  both  and  Cj^)  the 
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system  of  FiBuro  5.3  corrnsonds  to  a pure  birth-death  process  (see,  (or  example, 
[33,  p.89]).  ■ 

The  average  execution  time,  for  a complete  cycle  can  now  bo  evaluated  from  the 
resvilts  of  Theorem  5.1  using  equation  (5.9)  and  the  fact  that: 

<70  * 9)  ♦ . • • ♦ <7/^  - 1 . 

5.3  - A comparison  with  the  experimental  results 

The  results  of  Sections  5.1  and  5.2  provide  us  with  an  estimate  of  the  average  time 
required  to  execute  a complete  cycle  in  the  parallel  implementation  with  k processes  of 
Jacobi's  method  and  of  the  AJ  and  AGS  methods.  In  order  to  evaluate  the  total  running 
time  T for  the  throe  methods,  we  also  need  some  estimate  of  the  number  of  iterations 
required  by  each  of  the  methods  in  the  parallel  implementation  with  k processes.  In  the 
case  of  Jacobi’s  method,  docs  not  depend  on  k and  can  be  computed  analytically  from 
the  spectral  radius,  p(B),  of  the  Jacobi  matrix.  In  the  case  of  the  AJ  and  AGS  methods,  we 
have  simply  chosen  to  take  directly  Ihe  number  of  iterations  observed  in  the  experiments 
themselves. 


The  total  running  time  T/^  - now  follows  immediately.  The  resulting  values 
are  plotted  in  Figure  3.3,  along  with  the  values  observed  from  the  experiments.  (In  the 
case  of  Jacobi's  method,  is  evaluated  using  for  Fj^  a uniform  distribution.) 
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Time  (sec.) 


Number  of  processes 

Figure  5.4  - ExporlmenUI  and  theorclicat  running  times 


We  see  that  the  "theory"  matches  fairly  well  the  actual  measurements  especially  In 
the  case  of  most  interest,  1.  e.,  when  kip  (clearly  we  cannot  expect  any  gain  from  using 
more  processes  than  processors).  In  particular,  if  we  rely  on  our  model,  at  least  for  kip, 
we  can  compute  the  optimum  value  for  k (beyond  which  no  gain  is  obtained),  and  we  find, 
in  particular,  that 


*opl 


14  for  Jacobi's  method, 
< 15  for  the  AJ  method, 

. 12  for  the  AGS  method. 
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S - Concluding  remarks 

The  actual  implementation  of  parallel  algorithmo  on  an  asynchronous  multiprocessor 
has  proved  to  be  an  invaluable  help  for  providing  us  with  a better  understanding  of 
parallel  algorithms,  for  illustrating  some  of  the  notions  and  concepts  associated  with  those 
algorithms,  and  for  supporting  some  of  the  assumptions  that  we  have  introduced  in  their 
analysis.  In  particular,  the  figures  of  Section  4.2.1  show  clearly  that  the  eyeculipn  time  of 
a program  can  hardly  be  regarded  as  a constant,  and  that  it  is  more  accurate  to  consider 
this  execution  time  as  a random  variable  distributed  according  to  some  probability 
distribution.  In  view  of  the  histograms  presented  in  Figures  4.2  througli  4.9,  an  Erlang  or 
a normal  distribution  seems  to  be  a reasonable  approximation,  in  our  case,  to  account  (or 
the  fluctuations  in  the  execution  times  of  the  programs  that  we  have  implemented  on 
C.mmp. 

These  experiments  also  constitute  a clear  illustration  of  the  advantage  of  purely 
asynchronous  algorithms  over  synchronized  algorithms.  To  give  a quantitative  evaluation 
of  the  effects  o(  synchronization,  assume  that  it  takes  I unit  of  time  for  a process  to 
perform  one  step  of  the  iteration  (excluding  any  overhead).  Then,  it  follows  from  the 
results  we  have  presented  that,  in  a parallel  implementation  with  6 processes,  it  will  take 
each  process  an  average  of  about  J.OS,  1.62  and  2.34  units  of  time  with  the  PA,  the  AJ  and 
Jacobi’s  methods,  respectively,  to  perform  tlie  same  step  of  the  iteration  (for  that  matter, 
both  the  Aj  and  the  AGS  methods  have  the  same  behavior).  While  the  overhead  in  the  PA 
method  (about  57.)  is  mainly  due  to  memory  contention,  the  overheads  in  the  AJ  and 
Jacobi’s  methods  measure  almost  directly  the  effects  of  using  critical  sections  and  of  using 
full  synchronization  between  the  processes,  respectively. 

In  addition  to  the  experiments  reported  in  this  chapter,  we  have  also  run  some  other 
experiments  to  consider  the  effect  of  the  introduction  of  a relaxation  factor  in  the 
different  iterative  schemes.  These  results  confirmed  exactly  the  simulation  results 
obtetined  by  Roscnfeld  and  presented  in  [52].  In  particular,  while  we  are  guaranteed  of 


EXI’ERIMENTAL  RESULTS 


169 


the  convergence  of  any  asynchronous  iterations  when  we  use  a relaxation  factor  o in  the 
range  0 < o < 2/[I*p(n)],  this  is  not  so  when  o i 2/[l*p(B)],  and  divergence  was,  Indeed, 
often  observed  (for  the  problem  that  we  have  considered,  p(D)  -v  0.991,  thus 
2/[I*p(B)]  ~ 1.005).  It  seems  to  be  very  useful  to  obtain  more  (experimental  or  analytical) 
results  on  the  effects  of  using  relaxation  factors,  since  our  experiments  show  that  (when 
convergence  is  achieved)  it  is  a very  promising  way  to  accelerate  the  Iteration. 

The  results  presented  in  Section  5 arc  also  an  interesting  aspect  of  this  chapter. 
Wo  have  shown  how  simple  techniques  from  order  statistics  and  queueing  theory  coutd  be 
adapted  to  the  analysis  of  algorithms  for  asynchronous  multiprocessors.  The  analysis  that 
we  have  developed  gives  a fair  account  of  the  experimental  results.  This  is  very  useful  in 
practice  since  it  can  be  used  to  predict  the  optimal  decomposition  of  a problem  (i.  e.,  the 
optimal  number  of  processes  to  create  in  order  to,  for  example,  minimize  the  overall 
execution  lime). 


Chapter  VI 
Conclusion 


1 ■*  A summary  of  the  results  and  their  implications 

An  evident  advantage  of  using  asynchronous  multiprocessors,  and  parallel  computers 
in  general,  rather  than  conventional  uni-processors,  is  to  be  able  to  substantially  reduce 
the  execution  time  required  for  solving  a problem.  Given  a particular  parallel  computer, 
therefore,  one  of  the  first  goals  in  designing  a parallel  algorithm  for  solving  a problem  is 
to  try  to  minimize  the  required  execution  lime  on  the  given  machine.  This  leads  us 
naturally  to  consider  the  execution  time  of  a parallel  algorithm  as  one  of  the  primary 
measures  of  the  performance  of  the  algorithm. 

When  we  consider  a sequential  algorithm  for  solving  a given  problem,  say,  sorting  or 
matrix  multiplication,  the  number  of  comparisons  or  the  number  of  scalar  multiplications 
performed  by  the  algorithm  is  usually  used  as  the  measure  of  complexity  of  the  algorithm. 
In  this  respect,  parallel  algorithms  for  SIMD  machines  are  very  similar  to  sequential 
algorithms,  in  the  cense  that,  in  this  case,  the  number  of  parallel  instructions  (e.  g.,  parallel 
comparisons  or  parallel  nuilti plications)  is  the  usual  complexity  measure  of  an  algorithm. 
The  Intuitive  reason  lor  this  cost  measure  with  both  sequential  algorithms  and  parallel 
algorithms  for  SIMD  machines  is  that  the  execution  lime  in  these  two  types  of  algorithms 
is  directly  related  to  the  number  of  instructions  executed,  and  that,  therefore,  It  is 
realistic  to  only  count  those  instructions  for  performance  evaluation  purposes. 

When  we  are  dealing  with  a parallel  algorithm  for  asynchronous  multiprocessors, 
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however,  its  non-dclernunistic  behavior  contributes  to  making  Its  analysis  drasticatly 
different  from  the  analysis  of  a sequential  algorithm.  In  particular,  Ihere  usually  does  not 
seem  lo  exist  a direct  relation  between  the  (average)  execution  time  of  a parallel 
algorithm  for  multiprocessor  and  the  number  of  instructions  executed  by  each  of  the 
processes.  As  an  Illustration,  let  us  examine  again  Jacobi’s  method  for  solving  a linear 
system  of  n equations,  and  consider  a parallel  Implementation  with  k processes  In  which 
each  process  evaluates  q » n/k  components.  Let  us  first  choose,  as  a measure  of 
performance  for  this  imptementation,  the  number  of  parallel  eualiiatio/ts  of  a component 
(or,  within  a factor  of  n,  the  number  of  parallel  multiplications).  The  immediate  conclusion. 
In  this  case,  is  that,  in  order  to  decrease  the  cost  of  the  algorithm,  we  should  always 
increase  the  number  of  processors.  Let  us  now  consider  directly  the  total  average  time 
required  to  perform  one  step  of  the  iteration  with  Ihe  parallel  Implementation  with  k 
processes.  Assume,  as  before,  that  the  execution  times  for  the  evaluation  of  q components 


by  all  k processes  arc  independent  identically  random  variables  distributed  according  to 
an  exponential  distribution  with  mean  Then,  due  to  the  synchronization  between  the 


processes,  the  total  average  time  for  one  iteration  step  is  given  by  » ^k'^k'  ^k 

is  the  /t-th  harmonic  number.  Let  us  further  assume  that  Is  of  the  form  a * j-  b 
(which  is  natural  In  view  of  our  decomposition).  Then,  it  follows  that  for  large  k,  the  total 


average  lime  grows  with  k like  a.ln(lt)  and,  thus,  increases  as  the  number  of  processes 
increases.  Therefore,  we  conclude,  in  this  case,  that  there  exists  a (finite)  number  k of 


processes  which  minimizes  the  total  average  time  T^.  This  Is  in  contradiction  with  the 
conclusion  derived  from  using  the  other  cost  measure. 


This  example  shows  that  the  analysis  of  the  efficiency  of  a parallel  algorithm  for 
asynchronous  multiprocessors  usually  requires  techniques  very  different  from  those 
previously  developed  in  the  analysis  of  sequential  algorithms  or  parallel  algorithms  for 
SIMO  machines.  We  think  that  one  of  the  main  contributions  of  this  thesis  is  to  have 
presented  and  used  very  diverse  techniques  applicable  In  the  analysis  of  parallel 
algorithms  for  asynchronous  multiprocessors.  These  techniques  are  used  In  various 
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appUcallons  arras.  The  analyses  developed  in  Chapter  II  Section  5 and  in  Chapter  IV 
Section  7.3.1,  for  Instance,  are  related  to  some  analyses  commonly  found  in  Operations 
Research,  while  the  treatment  of  Section  6 of  Chapter  II  applies  some  techniques  typical  of 
renewal  theory.  In  Chapter  HI  Sections  6 and  8,  the  complexity  of  asynchronous  iterative 
methods  is  derived  using  the  tools  of  numerical  analysis  (this  is  obviously  due  to  the 
nature  of  the  problem  Irealed  in  this  chapter). 


Wo  also  have  presented  in  Chapter  V Section  5 some  of  the  techniques  which  seem 
to  be  most  typical  of  the  analysis  of  parallel  algorithms  for  multiprocessors,  namely 
techniques  drawn  from  order  statistics  and  from  queueing  theory.  An  important  advantage 
of  this  approach  is  that  a large  number  of  results  are  available  from  well  developed 
theories.  Most  of  these  results  are  directly  applicable  to  the  analysis  of  parallel 
algorithms  for  asynchronous  multiprocessors,  and  we  have  shown,  in  particular,  that  a very 
simple  queueing  model  (initially  Intended  to  represent  a time -shared  uni -processor) 
accounts  appropriately  for  the  behavior  of  an  asynchronous  parallel  algorithm  in  which 
the  processes  communicate  among  themselves  through  the  use  of  a critical  section.  These 
results  can  be  used  to  predict  the  optima!  decomposition  of  a problem  (1.  e.,  the  optimal 
number  of  processes  cooperating  in  the  solution  of  the  problem).  Some  other  examples  of 
the  use  of  queueing  theory  to  the  analysis  of  parallel  algorithms  for  multiprocessors  are 
also  presented  in  [51]  with  various  applications  to  sorting  algorithms. 

A deficiency  common  to  several  of  the  analyses  that  we  have  presented  is  that,  in 
some  cases,  strong  assumptions  must  be  made  in  order  to  be  able  to  carry  out  the  analysis 
of  an  algorithm.  In  Chapter  II  Section  5 and  in  Chapter  V Section  5.2,  for  Instance,  our 
results  arc  based  on  the  assumption  that  the  various  execution  times  arc  exponentially 
distributed.  Wo  have  observed,  however,  that  whenever  we  were  also  able  to  derive  an 
analysis  of  an  asytKhronous  algorithm  based  on  other  (more  realistic)  probability 
distributions  (see  Chapter  II  Section  6,  for  instance),  the  results  did  not  show  any 
•ub'.tanlial  ^differences  with  the  results  derived  from  the  exponential  distribution. 


17^ 


CHAPTER  VI 


Moreover,  the  analyticiil  results  derived  in  Chapter  V Section  5.3  are  In  excellent 
agreement  with  the  experimental  results  that  we  have  presented  in  Chapter  V.  Therefore, 
it  seems  that,  although  the  exponential  distribution  is  not  necessarily  a very  realistic 
assumplion  for  the  distribution  of  the  execution  times,  it  still  provides  us  with  useful 
results  for  asynchronous  algorithms.  In  the  case  of  synchronized  algorithms  (see 
Chapter  V Section  5.1),  however,  analytical  results  obtained  with  the  exponential 
distribution  do  not  show  an  excellent  agreement  with  the  experimental  results,  whereas  a 
closer  approximation  is  achieved  with  the  normal  and  the  uniform  distribulions.  A reason 
for  this  discrepancy  is  that  the  fluctuations  arc  measured  directly  by  the  slandard 
deviation  of  Ihe  probability  distribution  and  this  cannot  be  captured  by  the  exponential 
distribuUon  (for  which  the  standard  deviation  is  the  same  as  the  mean). 

Another  very  important  aspect  of  the  thesis  is  to  have  presented  and  illustrated 
some  of  the  notions  and  concepts  unique  in  the  design  of  parallel  algorithms  for 
asynchronous  multiprocessors.  The  algorithm  proposed  in  Chapter  It,  for  example, 
illustrates  an  a priori  very  counter-intuitive  idea  that  the  execution  of  a purely  sequential 
program  can  be  sped-up  on  an  asynchronous  multiprocessor  without  introducing  any 
parallelism  within  the  program  itself.  The  acceleration  is  achieved  by  decomposing  the 
program  into  a succession  of  tasks  (executed  serially),  and  by  taking  advantage  of  the 
fUictantions  in  the  execution  limes  of  the  tasks.  These  fluctuations  in  computing  times 
represent  a dimension  unique  in  the  design  of  parallel  algorithms  for  asynchronous 
multiprocessors.  Their  consequences  are  twofold.  A negative  aspect  Is  evidenced  with 
the  example  of  Jacobi’s  method  presented  in  the  introductory  chapteri  the  net  effect,  in 
this  case,  is  to  create  a substantial  overhead  due  to  the  use  of  a full  synchronization  of 
the  processes.  The  algorithm  of  Chapter  11,  on  the  other  hand,  demonstrates  that  the 
fluctuations  in  the  computing  limes  can  actually  be  used  to  accelerate  the  execution  of  a 
program.  Although  we  do  not  feel  that  the  algorithm  in  this  chapter  should  be  used 
directly  as  it  Is  presented,  we  think  that  the  idea  embedded  into  the  algorithm  can  be  used 
together  with  other  considerations,  such  as  reliability,  in  the  construction  of  asynchronous 
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algorithms.  Probably  the  most  Important  aspect  of  the  algorithm  presented  in  Chapter  II  is 
that  It  illustrates  the  fact  that  Innovations  are  required  (or  the  design  of  parallet 
algorithms  for  asynchronous  multiprocessors. 

The  experimental  results  presented  in  Chapter  V are  fundamental  in  the  thesis. 

They  lend  us  insight  into  the  behavior  of  parallel  programs  executed  on  an  asyi>chronous 

multiprocessor;  and,  with  a belter  understanding  of  their  behavior,  we  can  expect  to  be 

able  to  design  bolter  parallel  algorithms  for  multiprocessors.  In  addition,  they  have  been 
/ 

particularly  useful  in  validating  some  of  the  assumptions  that  we  have  made  in  our 
analyses.  These  experimental  results  arc  important  in  another  practical  aspect,  namely, 
they  provide  us  with  a quantitative  comparison  of  the  different  uses  of  synchronization. 

The  results  that  we  have  mentioned  so  far  contribute  directly  toward  the  general 
goal  of  the  thesis:  design  and  analysis  of  parallel  algorithms  for  asynchronous 
multiprocessors.  Some  of  the  results  of  the  thesis  seem  to  be  of  theoretical  and  practical 
importance  in  their  own  rights. 

In  Chapter  III,  for  instance,  we  have  introduced  the  class  of  asynchronous  iterative 
methods  to  remove  the  need  for  synchronization  in  the  implementation  of  iterative  methods 
on  a multiprocessor.  We  think  that  the  results  presented  in  this  chapter  are  a contribution 
to  the  area  of  iterative  methods,  and,  in  particular,  they  provide  some  extensions  and 
generalizations  of  previously  published  results  [11],  [41],  [42],  [43],  [50].  Theorem  4.1, 
for  example,  extends  the  convergence  results  obtained  by  Chazan  and  Miranker  (or  chaotic 
iterations  [11],  by  relaxing  a technical  condition  that  they  had  introduced;  furthermore, 
our  results  also  provide  a generalization  to  non-linear  operators.  The  results  of 
Section  5,  on  the  class  of  asynchronous  ilerativo  methods  with  memory,  also  generalizes 
some  of  the  results  obtained  by  Miellou  [42]. 

Chapter  IV  contains  some  Important  results  concerning  the  et-fi  pruning  algorithm. 
Wo  have  shown  in  the  first  part  of  this  chapter  that  the  branching  factor  of  the 
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ot~ft  pruning  algorltlim  In  a uniform  game  tree  of  degree  n Is  0f/i/ln  n),  when  all  bottom 
values  arc  assigned  independent  identically  distributed  random  variables.  This  confirms  a 
claim  by  Knulh  and  Moore  [35]  that  deep  cul-offs  only  have  a second  order  effect  on  the 
boliavior  of  the  algorithm.  The  results  of  the  second  part  constitute  the  main  contribution 
of  Chapter  IV.  We  have  proposed  in  this  part  an  asynchronous  parallel  Implementation  of 
the  af-/?  pruning  algorithm.  Our  analysis  of  the  parallel  implementation  with  k processes 
sliows,  rattier  surprisingly,  that  the  speed-up  is  larger  than  k.  This  implies  that  the 
(sequential)  or-/?  pruning  algorithm  is  not  optimal  and  can  be  substantially  improved  upon. 
This  particular  result,  which  has  been  obtained  very  indirectly  In  the  thesis,  might  find 
applications  in  the  area  of  Artificial  Intelligence. 

2 - Some  topics  for  future  research 

We  cerlainly  do  nol  believe  that  we  have  covered  in  this  thesis  every  possible 
aspect  of  the  design  and  the  analysis  of  algorithms  for  asynchronous  multiprocessors. 
Clearly,  much  research  remains  to  be  done  In  this  area,  and  this  section  mentions  several 
topics  for  futiirc  research. 

Wo  think  that  the  thesis  has  clearly  illustrated  an  important  characteristic  of 

algorithms  for  multiprocessors,  namely,  the  a priori  unpredictable  behavior  In  their 

execution.  This  characteristic,  therefore,  makes  it  an  absolute  requirement  to  consider 

very  carefully  the  correctness  of  parallel  algorithms  for  multiprocessors,  and  research  In 
/ 

this  area  would  certainly  be  very  useful.  We  are  (personally)  convinced  that  every 
algorithm  proposed  in  this  thesis  performs  correctly,  and  we  have  also  given  (we  hope) 
convincing  arguments  for  their  correctness.  However,  In  each  case,  the  proof  of 
correctness  is  based  on  techniques  which  are,  usually,  only  adequate  to  the  problem  at 
hand.  A formal  (and  general)  theory  would  cerlainly  be  a very  useful  tool  for  the  design 
of  algorithms  for  multiprocessors. 


Probably,  the  greatest  emphasis  of  the  thesis  has  been  placed  on  the  analysis  of 
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parallel  algorithmr.  for  asynchronous  mulUprocessors,  ancf  we  have  presented  (and  used) 
diverse  techniques  which  appear  to  be  applicable  to  numerous  problems.  . Those 
techniques  have  proved  to  bo  effective  lo  the  algorillims  presented,  but  we  Ihink  that  most 
of  them  could  still  be  improved  upon,  in  particular  with  regard  to  the  generality  of  their 
applications.  Possible  generalizations  in  this  area  would  include,  for  instance,,  the 
relaxation  of  some  of  the  assumplions  used  in'  the  various  ’analyses  that  we  have 
presented.  The  execution  time  of  an  algorithm  has  been  regarded  in  most  of  the  thesis  as 
the  primary  measure  of  complexity  of  the  algorithm.  While  this  measure  is,  in  fact,  of 
primary  importance  in  real  time  applications,  other  complexity  measures  should  also  be 
consiefered.  Processor  utilization,  for  example,  would  bo  another  meaningful  measure  of 
performance,  particularly  if  an  asynchronous  multiprocessor  is  used  in  a multi-user 
environment.  In  this  case,  it  would  also  be  of  Interest  to  consider  the  possibility  of 
increasing  the  processor  utilization  by  multiprogramming  several  programs  (for  example, 
several'  instances  of  the  same  parallel  algorithm). 

The  experiments  presented  in  Chapter  V have  proved  to  be  an  Invaluable  tool.  In 
general,  direct  experimentation  on  an  asynchronous  multiprocessor  can  be  very  useful 
especially  when  it  is  difficult  lo  derive  any  analytical  results.  In  particular,  it  would  be 
very  interesting  to  perform  more  experiments  with  asynchronous  iterations,  for  example, 
to  consider  the  effects  of  using  a relaxation  factor.  Other  experiments  could  also  be 
performed  to  evaluate  some  of  the  adaptative  asynchronous  iterations  described  in 
Section  3.4.2  of  Chapter  V. 

The  parallel  implementation  that  we  have  proposed  for  the  ee-fi  pruning  algorithm 
appears  lo  bo  very  efficient  when  few  processes  are  used,  but  the  maximum  speed-up 
achievable  with  this  method  Is  typically  limited  to  5 or  6 even  with  an  infinity  of 
processes.  It  does  not  seem  that  a direct  adaptation  of  the  pruning  algorithm  into  a 
parallel  algorithm  is  the  best  approach  to  follow,  particularly  because  it  is  based  on  a 
depth  first  search,  which  is  inherently  sequential.  A better  approach  would  probably  be 
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CHAPTER  VI 


to  consider  a game  tree  searching  algorithm  based  on  a best  first  search  along  with  a 
/ 

preliminary  evaluation  of  the  internal  nodes. 

Lastly,  we  view  this  thesis  as  a first  step  towards  a systematic  study  of  the  issues 
raised  by  the  design  and  the  analysis  of  algorithms  for  asytichrohous  multiprocessors. 
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algorithm  is  presented  to  illustrate  that  the  fluctuations  are  not  always  a negative  factor  | 
but  can  also  be  ulilized  advantageously.  The  algorithm  clemonslrates  the  seemingly  | 
counter-lntuilive  result  that  llie  execution  of  a purely  sequential  program  can  still  be  ! 
accelerated  on  an  asynchronous  multiprocessor  without  introducing  any  parallelism  within 
the  program  itself,  but  only  by  taking  advantage  of  the  fluctuations  in  computation  times. 
Two  different  parallel  implementations  of  this  algorithm  arc  proposed  (with  and  without 
critical  section),  and  analyses  are  presented  to  measure  the  speed-up  achievable. 

In  the  domain  of  numerical  applications,  the  class  of  asynchronoas  iterative  methods 
Is  introduced  to  remove  the  need  for  synchronization  in  the  implementation  of  iterations  * 

. for  solving  a system  of  equations  on  a multiprocessor.  This  class  includes  iterations  j 
corresponefing  to  parallel  implementations  in  which  the  cooperating  processes  have  a j 
minimum  of  inter-communication  and  do  not  make  any  use  of  synchonizalion.  The  Purely 
asynchronous  method  is  a typical  example.  A sufficient  condition  is  established  which 
guarantees  the  convergence  of  any  asynchronous  iterations.  This  condition  is  satisfied  for  i 
systems  of  equations  found  in  numerous  practical  applications. 

Several  asynchronous  iterations  have  actually  been  implemented  on  an  asynchronous 
multiprocessor.  Experimental  results  arc  reported,  and  they  show  that  the  Purely 
Asynchronous  method  achieves  an  almost  optimal  speed-up.  The  experiments  constitute  an 
illustration  of  the  various  notions  and  concepls  specific  to  the  design  ancf  analysis  of 
parallel  algorithms  for  asynchronous  multiprocessors.  It  is  also  shown  how  simple 
techniques  drawn  from  order  statistics  and  queueing  theory  can  be  used  to  predict  the 
experimental  results  with  a fair  accuracy. 

Tlie  a-/l?  pruning  algorithm  serves  as  an  example  of  a non-numerical  application  in 
this  thesis.  The  sequential  algorithm  is  first  analyzed,  and  it  is  sliown  that  the  branching 
factor  of  the  tx-/?  pruning  algorithm  for  a uniform  game  tree  of  degree  n grows  with  n as  t 
0(n/tn  n).  This  confirms  a claim  by  Knulh  and  Moore  that  deep  cut-offs  only  have  a ' 
second  order  effect  on  the  behavior  of  the  algorithm.  The  results  obtained  with  the 
sequential  algorithm  arc  then  used  to  derive  an  efficient  parallel  implementation.of  Ihe^ 
tx-/?  pruning  algorithm  on  an  asynchronous  multiprocessor.  An  analysis  of  the  parallel 
implementation  with  k processes  shows,  rather  surprisingly,  an  improvement  over  the 
original  algorithm  by  a factor  larger  than  h. 
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